Part 11 — Tools and Technology Stack

“A practitioner’s stack is a working artefact, not a wishlist. Every tool you adopt is a tool you must maintain, defend, and justify under analytical scrutiny.”

This chapter is the operational tool reference for the independent intelligence analyst. It does not catalogue every tool that exists — that work is done elsewhere, and is obsolete within a quarter. It catalogues the tools that an independent practitioner can realistically deploy, justify against an OPSEC threat model, and sustain at solo-operator scale, organised by analytical function.

The audience is the working analyst building or rebuilding a personal stack: a journalist, a researcher, a small consultancy operator, a solo open-source intelligence practitioner. The frame is operational. Each tool entry answers four questions: what does it do, what is the cost-tier and what do you get for it, what does using it cost you in OPSEC terms, and when do you prefer it to its alternatives.

For the foundational tool-by-tool walkthrough at beginner level, see OSINT Toolkit Essentials. This chapter does not duplicate that material — it sits one layer above, treating tools as components of an analyst’s working architecture rather than as standalone capabilities.


1. Design Principles for an Independent Analyst’s Stack

The independent analyst designs a stack under constraints that institutional environments do not face. A government intelligence officer or a corporate threat-intelligence team operates inside a procured stack, with vendor relationships, dedicated infrastructure staff, classified data feeds, and a budget envelope that absorbs $50,000 per-seat enterprise licences without flinching. The independent practitioner has none of that. The independent practitioner has a laptop, a credit card, and a finite weekly maintenance budget measured in hours, not in full-time engineering staff.

Four design principles govern selection. Each is a constraint, and each constraint forces an architectural decision before any specific tool is named.

1.1 Open-source preferred

Tools with auditable source code are preferable to black-box commercial tools, particularly for OPSEC-sensitive collection work. The reasons are operational, not ideological:

  • Auditability. You can inspect what an open-source tool does on your machine. With a commercial closed-source tool, you trust the vendor’s privacy policy and their implementation of it.
  • No vendor lock-in. Commercial vendors discontinue products, pivot business models, raise prices, or get acquired and have their tooling absorbed into enterprise suites that no longer serve independent users (the RiskIQ acquisition by Microsoft is a textbook case). An open-source tool you can fork, archive, and self-host outlives its vendor.
  • No budget risk. A free-tier commercial tool can move behind a paywall on a quarterly earnings call. Your stack should not have single points of catastrophic failure tied to another company’s product decisions.
  • Better community knowledge. Open-source tools have public issue trackers, public documentation, public abuse reports. You learn how a tool actually behaves in the field, not how its marketing department describes its behaviour.

This is a preference, not an absolute rule. There are categories — high-quality satellite imagery, certain corporate registries, the better neural machine translation systems — where the open-source alternative is materially weaker than the commercial product, and the commercial product is the right operational choice. The principle says prefer open-source when capability is comparable, not refuse commercial tools categorically.

1.2 OPSEC-compatible

Every tool that requires connecting to a third-party server with your analytical activity creates an OPSEC surface. The tool’s vendor sees what you search, what entities you investigate, what subjects you find interesting. For the independent analyst, this is not theoretical:

  • A practitioner who runs Shodan searches for industrial control systems in a specific country has told Shodan that they are interested in industrial control systems in that country.
  • A practitioner who reverse-image-searches a face on a Russian search engine has handed that face, and their interest in it, to that search engine’s owner.
  • A practitioner who routes their email through a free webmail service has agreed that the provider may scan that mail for advertising or training data.

Selection should be evaluated against the threat model developed in Part 08. Where the analytical subject is low-sensitivity (open-source historical research, public-policy analysis), permissive tool selection is acceptable. Where the subject is high-sensitivity (live conflict attribution, threat-actor profiling, work touching state interests), prefer local processing, hardened collection environments, and tools with strong data-handling policies. Where local processing is technically feasible without significant quality loss, prefer local processing.

The independent analyst’s OPSEC threat model is not the threat model of a clandestine intelligence officer; it is the threat model of a professional whose research subject may attempt to identify, profile, or retaliate against them, and whose work has commercial and reputational value that should not leak to competitors or training-data scrapers.

1.3 Maintainable at solo scale

Tools that require significant ongoing maintenance or have frequent breaking changes create operational drag. Every tool you adopt costs you maintenance time: updates, configuration drift, dependency churn, integration with other tools in your stack. At solo scale, the maintenance cost is paid out of your analytical time.

The failure mode is the analyst who spends Tuesday updating their self-hosted instances, Wednesday debugging a broken integration, Thursday refactoring their scraper scripts after a platform change, and Friday writing the deliverable they were supposed to write all week. Maintenance is not analysis. A stack that maximises analytical throughput minimises tool-tending time.

Three operational heuristics follow:

  • Adopt slowly. Resist the impulse to add a new tool for each new analytical need. Stretch existing tools first.
  • Standardise where possible. One reference manager. One knowledge base. One translation tool. Parallel tools fragment your attention and your archive.
  • Retire ruthlessly. A tool you have not used in six months is consuming maintenance overhead for no analytical return. Remove it.

1.4 Cost-justified

For commercial tools, analytical value must justify cost. A simple test: would you pay this much for this tool if you were paying out of your own pocket every month? You are.

Many free-tier tools are adequate for independent practitioners. Free-tier Shodan, free-tier Censys, free Sentinel Hub, free OpenSanctions, free OCCRP Aleph, free SEC EDGAR — these are not constrained substitutes for premium tools; they are professional-grade resources that institutions also use. The premium tiers add scale (higher rate limits, larger result sets, historical depth), not capability.

The highest-cost tools — Palantir, Babel Street, Refinitiv/LSEG World-Check, LexisNexis Risk Solutions, the enterprise tier of Maltego — are not accessible at independent-practitioner price points (entry pricing is typically $30,000–$250,000 per year), and are not the appropriate tools for independent practice in any case. Their architecture assumes a team with shared workspaces, an institutional data agreement framework, and infrastructure to absorb the data they produce. The independent practitioner who somehow accessed Palantir Gotham would discover they have purchased a Ferrari to drive on a dirt road.


2. Stack by Functional Category

The functional taxonomy below maps the analyst’s workflow: collection → infrastructure investigation → corporate/financial investigation → geolocation → image/video verification → document processing → knowledge management → AI augmentation → OPSEC. Tools are organised by what they let you do, not by what category they belong to in a vendor catalogue.

2.1 Collection and Feed Monitoring

ToolCost tierOPSECBest for
Inoreader ProFreemium ($9.99/mo Pro)Server-side; provider sees feed listHigh-volume RSS + rule-based routing
FeedlyFreemiumServer-side; provider sees feed listAI-filtered feed summarisation
Telegram DesktopFreeTelegram metadata visibleManual channel monitoring
tgstat.com / Telemetr.ioFreemiumServer-side; web accessChannel analytics, cross-post mapping
Google AlertsFreeServer-side; very weak privacyCrude name monitoring
Talkwalker AlertsFreemiumServer-sideBetter than Google Alerts
F5BotFreeServer-sideReddit keyword monitoring

Inoreader Pro is the operational default for serious RSS-based collection. It is the strongest filtering and rule-based routing engine in the public market; the Pro tier ($9.99/month) supports rules that route feed items by keyword, source pattern, or topic into separate streams, which is the only way to manage a high-volume cross-domain feed architecture without drowning. The free tier is adequate below 500 feeds. The API is exposed for programmatic integration into downstream pipelines. For the reference stack configuration, see Inoreader Pro — Collection Stack Configuration.

Feedly is the principal alternative. Comparable free tier; the Leo AI feature in paid tiers performs adequate priority filtering and summarisation, which can compress the morning triage. UX is more polished than Inoreader’s. The trade-off is weaker rule-based routing — Feedly is opinionated about how it thinks you should consume feeds, where Inoreader is configurable.

Telegram monitoring is the unsolved problem of the open-source collection toolkit. There is no single dominant tool. Operational approaches:

  1. Manual monitoring via Telegram Desktop (free) — adequate for following 20–50 channels personally. Above that count, fatigue sets in and you start missing things.
  2. Telegram Bot API for programmatic monitoring (free, requires scripting) — bots can join public channels and forward content to your own ingestion pipeline. Rate limits are non-trivial. Requires a phone number and self-hosted scripting.
  3. tgstat.com and Telemetr.io (freemium) — channel analytics, cross-posting analysis, growth metrics, audience overlap. Best for understanding a network of channels, not for live monitoring of message content.
  4. Channel discovery — use existing curated monitoring lists (Ukrainian war-reporter aggregations, Middle East tracker networks, regional aggregator channels) rather than building a list from scratch.

Google Alerts is crude but free. False-positive rate is high; coverage is patchy; latency is unpredictable. Adequate as a supplementary tripwire for less-covered names or topics where you want to know if anything gets indexed, but not reliable as a primary monitoring tool. Mention.com and Talkwalker Alerts are materially more capable; Talkwalker has a usable free tier.

F5Bot is the underrated tool in this category — free keyword monitoring across Reddit (and historically other forums). For analytical subjects with active community discussion, F5Bot surfaces conversation that does not reach mainstream RSS feeds.

2.2 Network and Infrastructure Investigation (Cyber CTI)

This is the category where free-tier commercial tools and free open-source tools are jointly so strong that an independent practitioner can match institutional capability for routine infrastructure investigation work. See BGP Routing and DNS Infrastructure for the analytical background.

ToolCost tierOPSECBest for
ShodanFreemium ($49/mo small business)Server-side; queries loggedInternet-exposed services, CVE search
CensysFreemium (250 queries/mo free)Server-sideComplement to Shodan; different scan methodology
GreyNoiseFreemiumServer-sideDistinguishing background scanning from targeted activity
RiskIQ Community / MS Defender TIFree tierServer-sidePassive DNS, certificate history, infrastructure pivot
crt.shFreePublic log searchCertificate transparency, subdomain enumeration
ViewDNS.infoFreeServer-sideWHOIS history (free tier)
DomainToolsPaid-requiredServer-sideWHOIS history (comprehensive, expensive)
CIRCL pDNSFree (registration)Server-sidePassive DNS lookups
Farsight DNSDBPaid-requiredServer-sideComprehensive passive DNS
VirusTotalFreemiumServer-side; uploads are searchableFile and URL reputation, passive DNS
BGP.tools / BGPViewFreeServer-side queriesASN, BGP routing analysis
RIPE NCC RISFreeServer-side queriesBGP routing data

Shodan is the foundational tool. The free tier (2 saved searches, 100 results per query, no historical data) is adequate for occasional infrastructure lookups; the small-business tier ($49/month) unlocks full search, historical data, more API calls, and is the right tier for any practitioner doing routine cyber-CTI work. Shodan’s CVE search — querying by CVE identifier to find exposed vulnerable services — is the single most operationally useful feature for vulnerability-context investigations.

Censys uses a different scanning methodology than Shodan and frequently captures hosts and services that Shodan does not, and vice versa. Censys Community provides 250 free queries per month. Censys ASM (the enterprise attack-surface-management product) is not accessible to independent practitioners and is not necessary for independent work. The operational pattern is: run both Shodan and Censys on infrastructure questions and merge results. Their coverage is non-redundant.

GreyNoise solves a specific analytical problem: distinguishing internet background-scanning noise from targeted activity directed at a specific host. If you find an IP scanning your target, GreyNoise tells you whether that IP is scanning the entire internet (in which case it is uninteresting) or only your target (in which case it warrants investigation). The Community tier provides limited API calls; the web visualiser at viz.greynoise.io is free for IP lookups.

RiskIQ Community Edition (now part of Microsoft Defender Threat Intelligence) provides passive DNS, certificate history, WHOIS history, and infrastructure pivot — the four pillars of infrastructure investigation in a single interface. The free tier was historically generous; access patterns have shifted with the Microsoft acquisition and should be checked at time of use.

crt.sh is the certificate transparency log search engine. It is free, has no rate limits worth caring about, and is the canonical tool for subdomain enumeration via certificate logs. For any target domain, the first move is crt.sh?q=%.target.tld to enumerate every subdomain that has ever had a publicly logged certificate. Nothing else in the free toolkit is this productive on a per-query basis.

Passive DNS (the historical record of which domains have resolved to which IPs over time) is critical for infrastructure investigation and is the area where free tooling is most constrained. CIRCL’s passive DNS service (free with registration) provides moderate historical depth. VirusTotal’s passive DNS data (free with a login) is broad but shallow. Farsight DNSDB is the gold-standard commercial source and is priced for institutions. For the independent practitioner, the operational pattern is to triangulate CIRCL + VirusTotal + RiskIQ for passive DNS, accept that you will sometimes miss historical resolutions, and document the limitation in your sourcing.

BGP analysis — BGP.tools, BGPView, RIPE NCC RIS — is essential for attributing infrastructure to organisations and jurisdictions. Knowing which ASN announces a prefix, who owns that ASN, where it peers, and how routing has changed over time is the bridge between “this IP is hosting X” and “X is operated from jurisdiction Y by organisation Z”. All three sources are free; their coverage is highly complementary.

2.3 Corporate and Financial Investigation

ToolCost tierOPSECBest for
OpenSanctionsFree (API: free with registration)Server-sideSanctions screening, 180+ source aggregation
OpenCorporatesFreemiumServer-sideGlobal corporate registry search
ICIJ Offshore Leaks DBFreeServer-sidePanama, Pandora, Paradise, Bahamas papers
OpenOwnership / BODS RegisterFreeServer-sideBeneficial ownership
OCCRP AlephFree (registration)Server-sideInvestigative documents, oligarch research
SEC EDGAR / EFTSFreeServer-sideUS public company filings, full-text search
PACERPaid-per-pageServer-sideUS federal court records (primary)
CourtListener / RECAPFreeServer-sideUS federal court records (mirror)
USASpending.govFreeServer-sideUS federal contracts, grants, loans
TED.europa.euFreeServer-sideEU public procurement

OpenSanctions is the most operationally useful free corporate tool in existence. It aggregates 180+ sanctions and political-exposure sources (OFAC SDN, EU Consolidated, UN, UK, Swiss SECO, Ukrainian SBU lists, Canadian SEMA, Australia DFAT, and dozens of national lists) into a single searchable database, updated daily. The API is free with registration. Every corporate investigation begins here. The data quality is good enough that institutional sanctions-screening vendors increasingly use OpenSanctions as one of their feeds.

OpenCorporates is the largest open corporate registry aggregator, covering 150+ jurisdictions. The free tier provides web search with rate limits; bulk and API access is paid. For independent practice, the web tier is usually sufficient. The structural value is being able to search for officers and directors across jurisdictions in a single query, which catches cross-border corporate networks that a per-jurisdiction search would miss.

ICIJ Offshore Leaks Database is the front-end to the Panama Papers, Pandora Papers, Paradise Papers, Bahamas Leaks, and Swiss Leaks. Free search and download. Essential for any investigation touching offshore structures, beneficial-ownership obfuscation, or wealth-concealment networks. The data is dated (the leaks are from specific years) but the entity and relationship graph remains analytically valuable for understanding network structure.

OpenOwnership aggregates beneficial-ownership data from national registries with growing global coverage. The UK PSC register, Denmark, Norway, and Ukraine (via OpenDataBot) are the most useful jurisdictions inside the OpenOwnership ecosystem. Beneficial-ownership registers in most other countries remain either non-existent, paywalled, or restricted to authorised users.

OCCRP Aleph is the investigative-journalism document and entity database operated by the Organized Crime and Corruption Reporting Project. Free for individual researchers with registration. Coverage is strongest for oligarch research, Eastern Europe, South America, and corruption networks. The corpus combines court records, corporate filings, media reports, leaks, and contributed datasets. Treat Aleph as the equivalent of an institutional analyst’s case-management system: a place to search across heterogeneous document collections and find the entity-of-interest references that no single source surfaces.

SEC EDGAR is the US Securities and Exchange Commission’s filing system. It is free and exhaustive for US public-company filings. The EDGAR Full-Text Search (efts.sec.gov), added in 2022, is dramatically more useful than the legacy filing-by-filing browse interface — you can now search the text of every 10-K, 10-Q, 8-K, proxy statement, and registration statement in a single query. For any investigation touching US public companies, EDGAR full-text search is the first stop.

PACER and CourtListener together cover US federal court records. PACER (the official Public Access to Court Electronic Records system) charges per-page fees; the fees are modest individually but cumulative across a large case. CourtListener, operated by the Free Law Project’s RECAP initiative, mirrors documents that researchers have purchased from PACER, making them available for free. The operational pattern is to check CourtListener first for any document, and only buy from PACER when CourtListener does not have it. For state court records, Trellis is the principal commercial aggregator, and state-specific portals exist with widely varying quality and access models.

USASpending.gov provides full-text search of US federal contracts, grants, and loans. Free. Essential for tracing government-contractor relationships, identifying which entities receive federal funds, and mapping contract-network structure. For EU public procurement, TED.europa.eu (Tenders Electronic Daily) is the equivalent free resource.

2.4 Geolocation and Satellite Imagery

ToolCost tierOPSECBest for
SunCalc.netFreeServer-side queryShadow-angle geolocation, temporal verification
Sentinel Hub EO BrowserFreeServer-sideCopernicus Sentinel-2 (10m, 5-day revisit)
Google Earth ProFreeServer-sideHistorical imagery, terrain, measurement
Planet LabsPaid-required (academic via NICFI)Server-sideDaily-revisit commercial (3m)
Maxar Open DataFree (event-specific releases)Server-side30cm imagery for major events
Google Maps / Bing MapsFreeServer-sideComparative imagery, different revisits
What3wordsFreeServer-sideImprecise-location resolution

SunCalc.net is the irreplaceable geolocation tool. Enter a coordinate and a datetime; it returns sun azimuth and elevation. Shadow-angle analysis on a photograph — measuring the bearing and length of a shadow against a known object’s height — yields a sun-azimuth/elevation pair, which constrains the photograph’s location and time to a narrow band. SunCalc is the lookup that turns the measurement into a verification.

Sentinel Hub EO Browser is the free alternative to commercial satellite imagery. It provides browser access to the European Space Agency’s Copernicus Sentinel-2 imagery (10-metre resolution, 5-day revisit at the equator, free under Copernicus open-data policy). True-colour, false-colour, and specialised band combinations are available. The 10-metre resolution is inadequate for identifying individual vehicles or persons but is fully adequate for confirming the presence of constructed infrastructure, troop concentrations, damage extent, and large-scale ground change. For independent conflict OSINT and geopolitical analysis where commercial imagery is unavailable, Sentinel-2 is the analytical baseline.

Google Earth Pro (free since 2015) is the historical-imagery workhorse. Its archive of historical imagery in specific areas often spans a decade or more, which is sometimes irreplaceable for before-and-after analysis. The 3D terrain, measurement tools, and overlay capability make it the standard tool for terrain matching during geolocation work.

Planet Labs is the commercial standard for high-cadence imagery: daily revisit, 3-metre resolution, global coverage. No meaningful free tier exists for operational use. Academic and NGO access is available through Planet’s NICFI (Norway’s International Climate and Forests Initiative) programme, which is restricted to tropical-forest research. The commercial cost is significant — typical institutional licences run into five figures annually. Independent practitioners who need Planet imagery for specific work occasionally collaborate with an institutional partner who has access; this is the only realistic path.

Maxar Open Data Program releases high-resolution imagery (30-centimetre) for significant events — major disasters, escalating conflicts, attention-grabbing geopolitical incidents — on an ad-hoc basis. It is not a routine monitoring tool, but the quality when available is exceptional and the licence is permissive for analytical use.

Google Maps and Bing Maps use different imagery sources, captured on different dates. Comparing them — and comparing both with Google Earth’s archive — frequently reveals additional resolution or different temporal coverage. Bing has periodically shown higher-resolution imagery than Google in specific areas, particularly in parts of Eastern Europe and Central Asia.

What3words is a geocoding system that converts coordinates into three-word identifiers. The analytical utility is the reverse: resolving imprecise location descriptions in social media (where a witness says “the explosion was near the bridge by the market”) into a checkable coordinate. Not a collection tool, but a useful translation layer between human and machine geographic vocabulary.

2.5 Image and Video Verification

ToolCost tierOPSECBest for
InVID/WeVerifyFree (browser extension)Server-sideVideo keyframe extraction, multi-engine reverse search
Google Reverse ImageFreeServer-side; query loggedReverse image — global content
Yandex ImagesFreeServer-side; Russian jurisdictionReverse image — Russian, Eastern European, faces
Bing Visual SearchFreeServer-sideReverse image — additional coverage
TinEyeFreemiumServer-sideReverse image — exact-match focused
FotoForensicsFreeServer-side; uploadJPEG manipulation detection (ELA)
ExifToolFree (CLI, local)LocalComprehensive metadata extraction
yt-dlpFree (CLI, local)Local on machine; server-side for downloadVideo preservation, 1,000+ site support

InVID/WeVerify is the verification workhorse — a browser extension developed by an EU research consortium specifically for journalistic and analytical verification. Keyframe extraction from video, simultaneous reverse-image search across Google/Yandex/Bing/TinEye/Baidu, EXIF and metadata viewing, image magnification, video forensic indicators. Free. There is no realistic substitute. Every analyst doing image or video verification should have InVID/WeVerify installed.

Reverse image search is a multi-engine practice, not a single-engine practice. The engines have meaningfully different indices:

  • Yandex significantly outperforms all others for Russian, Ukrainian, and broader Eastern European content, and for face matching in general. The strong face-matching capability is operationally valuable and ethically fraught — see Part 10 for ethical constraints.
  • Google is adequate for global English-language content and has the largest overall index for general images.
  • Bing Visual Search sometimes produces results the others miss, particularly for commercial imagery and certain regions.
  • TinEye is exact-match focused — useful when you need to know precisely where the same image (not similar images) has been published.

The operational pattern is to run all four (InVID/WeVerify automates this) and merge results.

FotoForensics (fotoforensics.com) provides Error Level Analysis (ELA) for JPEG manipulation detection. ELA is not a definitive forensic test — it identifies areas of an image with different compression histories, which often correlates with editing but can also occur for benign reasons (saved-twice, different software). Treat ELA results as a prompt for closer examination, not as proof of manipulation.

ExifTool is the comprehensive metadata extraction tool, written by Phil Harvey, free and open-source, command-line. It handles essentially any file format and extracts every metadata field present. The standard for metadata forensics. Available in most Linux package managers; the canonical reference distribution is the author’s website. Run on any image or document you receive before analysing it.

yt-dlp is the open-source successor to youtube-dl, supporting downloads from 1,000+ video sites. Free. Essential for preserving video evidence before deletion — and given the rate at which sensitive video is removed by platforms or by uploaders under pressure, preservation discipline is not optional. Pair with a hash-and-archive workflow (compute SHA-256 on download, store hash and timestamp in your case notes) for chain-of-custody-grade preservation.

2.6 Document Analysis and Language

ToolCost tierOPSECBest for
DeepLFreemium (free: 500k chars/mo)Server-sideHigh-quality European-language translation
Google TranslateFreeServer-sideBroad-language gisting (AR/FA/ZH/JA/KO)
pdfminer.six / pdfplumberFree (Python, local)LocalPDF text and table extraction (programmatic)
Tesseract OCRFree (CLI, local)LocalOCR, 100+ languages
GROBIDFree (self-hosted)LocalAcademic-paper structured extraction
Camelot / TabulaFree (local)LocalPDF table extraction to structured data

DeepL is the best neural machine translation available for European languages. German, French, Spanish, Italian, Portuguese, Russian, Polish — DeepL’s output is materially better than Google Translate’s for these languages, and the gap matters when analytical decisions depend on precise reading of source material. Free tier covers 500,000 characters per month; DeepL Pro ($9.99/month) removes the limit. For an independent analyst doing serious multilingual collection, DeepL Pro is one of the highest-return paid subscriptions in the stack.

Google Translate remains the better option for Arabic, Persian, Chinese, Japanese, and Korean, where DeepL coverage is either absent or weaker. Free. Use it for gisting (understanding the substance of source material) rather than for quoting (citing language verbatim). For quotation, find a human translator or, at minimum, get a second machine-translation pass and a native-speaker spot-check.

The multilingual collection rule from Part 03 — that you read primary state-actor sources in their native language whenever possible — depends entirely on having competent translation tooling.

Programmatic document processing (pdfminer.six, pdfplumber, Tesseract, GROBID, Camelot, Tabula) is the toolkit for analysts who routinely process large document collections. The case for learning enough Python to run these tools — even at the level of adapting example scripts — is strong: institutional analysts have research-engineering teams that do this work; the independent analyst either does it themselves or does not do it at all.

  • pdfminer.six / pdfplumber for text extraction from PDFs.
  • Tesseract OCR (paired with Poppler / pdf2image for PDF page rendering) for scanned documents.
  • GROBID for structured extraction from academic papers — title, authors, abstract, sections, references — into clean structured records. Self-hosted; one-time setup; high return for any analyst processing literature.
  • Camelot and Tabula for PDF table extraction. Regulatory filings, financial statements, official reports, and most government documents contain tables that are useless until extracted into machine-readable form. These tools do that extraction.

2.7 Knowledge Management and Research Organisation

ToolCost tierOPSECBest for
ObsidianFree (personal) / paid Sync optionalLocal-firstAnalyst’s vault-as-intelligence-system
ZoteroFreeLocal + optional syncReference management, PDF annotation
JoplinFreeLocal + E2E-encrypted syncSimpler alternative to Obsidian

Obsidian is the reference standard for the independent analyst’s knowledge management foundation. Markdown-based (so the underlying files are plain text, future-proof, and grep-able); local-first (no cloud lock-in; your vault lives on your filesystem); graph visualisation; bidirectional backlinks; tag hierarchies; an extensive plugin ecosystem. Free for personal use. The vault-as-intelligence-system pattern — where the vault is not a notes-app but a working knowledge architecture organised by analytical function, with frontmatter-driven metadata, wikilinked entity references, and disciplined section structure — is the foundation on which the rest of the independent practice is built. See Obsidian for Intelligence Analysis for the operational architecture.

Zotero is the reference manager. PDF annotation, automatic bibliographic metadata capture (via the Zotero Connector browser extension), citation export in any major style, shared-library support, full-text search across the collection. Free and open-source. The complement to Obsidian: Zotero holds the documents and their formal bibliographic records; Obsidian holds your analytical work, with wikilinks into the Zotero corpus where citations matter.

Joplin is the open-source, end-to-end-encrypted note-taking alternative for analysts who want strong encryption-at-sync without Obsidian’s graph and plugin complexity. Free. Adequate for sensitive operational notes that should not live in a typical cloud notes service. The trade-off is feature minimalism — Joplin is not a vault-as-intelligence-system; it is a notes app with good encryption.

2.8 AI Augmentation for the Independent Analyst

The realistic assessment of LLM utility under independent-analyst constraints is documented in LLM-OSINT-SOP-A2IC. The summary below is for tool-selection purposes.

What LLMs genuinely improve for the solo independent analyst:

  • Translation quality and domain-specific terminology. LLMs handle technical military, legal, and financial vocabulary better than general MT engines, particularly for low-frequency terms-of-art that DeepL and Google Translate render literally.
  • Bulk processing of unstructured text. Telegram channel archive analysis, document-collection summarisation, multi-document entity extraction — tasks that scale linearly with text volume and that no single analyst can do by hand in reasonable time.
  • Adversarial review. The two-AI protocol from Part 06 — using one model to draft and a separate model to critique — substitutes (imperfectly) for the institutional red-team function.
  • Report drafting scaffolding. BLUF, key judgements structure, executive summaries, alternative-hypothesis framing — LLMs produce serviceable first drafts of structured analytical product that the analyst then revises and tightens.
  • Entity extraction from large document sets. Pulling person/organisation/place references out of a corpus into a structured list, suitable for downstream graph-construction or for sanctions cross-checking.

What LLMs do not replace:

  • Source verification. LLMs confabulate. Verification requires primary-source cross-check. Any citation generated by an LLM is presumed-fabricated until you have personally verified the source exists, says what the LLM claims, and is the authority the LLM implies.
  • Geolocation. LLMs cannot reliably interpret satellite imagery, perform shadow-angle analysis, or read terrain. Multimodal models perform poorly on this category and overstate their confidence.
  • Attribution. LLMs do not have access to current threat intelligence. Their pre-training data is temporally stale (months to years behind), and they have no live signal on actor infrastructure, current campaigns, or recent reattributions.
  • Confidence calibration. LLMs systematically overstate confidence. Output that reads as “highly probable” should be treated as a hypothesis, not a finding.

Tool selection for AI augmentation:

OptionOPSEC postureHardwareCost
Claude (Anthropic API, ZDR endpoint)Strong (zero data retention if enabled at org level)None specialPay-per-token, typically $20–100/mo
Gemini Advanced (Google Vertex AI, ZDR)StrongNone specialSubscription or pay-per-token
Local — Ollama with Llama-3-70B / Mistral LargeStrongestHigh-end GPU or large RAMHardware-only after acquisition
Local — smaller models (Llama-3-8B, etc.)StrongestConsumer GPUHardware-only

The trade-off is straightforward: local deployment offers the strongest OPSEC posture but requires hardware investment and accepts a meaningful capability gap relative to frontier hosted models. Llama-3-70B and Mistral Large at full precision require 40+ GB of GPU memory, which is workstation-class hardware. Quantised versions run on more accessible hardware with quality degradation that varies by task. For analysts on consumer hardware, Claude API with zero-data-retention (configured at the organisation level) is the best balance of capability, OPSEC, and cost.

Budget realism: at typical independent-analyst usage volume — say, 2–3 investigations per month plus a weekly newsletter — Claude API costs run $30–50 per month at current pricing. A heavy bulk-processing analyst could reach $100–200 per month. This is an order of magnitude cheaper than any institutional analytics platform and within the operational budget of any seriously commercial independent practice.

2.9 OPSEC Toolchain

ToolCost tierPurpose
SignalFreeEncrypted messaging, source comms
Proton Suite (Mail / Drive / VPN)Freemium ($9.99/mo paid)E2E-encrypted mail/storage/VPN
KeePassXCFreeLocal-only password manager
YubiKeyPaid ($25–55 one-time)Hardware 2FA
TailscaleFree (personal: 3 users / 100 devices)WireGuard mesh for personal infra
Librewolf / Firefox + arkenfoxFreePrivacy-hardened collection browser
Qubes OSFree (high learning curve)VM-compartmentalised OS

Signal is the standard for encrypted messaging and source communications. Free, open-source, operated by the Signal Foundation (a non-profit). Use the desktop and mobile clients together (linked via QR code) so that you can handle long-form source material from the keyboard rather than from a phone. Signal’s metadata posture is the strongest in the consumer messaging ecosystem; treat it as the default for anything sensitive.

Proton Suite — ProtonMail, Proton Drive, ProtonVPN — provides end-to-end-encrypted email, cloud storage, and VPN in a single Swiss-jurisdictioned ecosystem. Mail free tier is adequate for basic use; paid plans ($9.99/month) add storage, custom-domain support, and more aliases. ProtonVPN’s free tier is genuinely usable — no bandwidth limits, limited server selection — which makes it one of the few free VPNs an analyst can actually rely on. The Proton ecosystem is not the only way to achieve these capabilities, but the integration across services is operationally convenient and the jurisdiction is favourable.

KeePassXC is the OPSEC-optimal password manager: free, open-source, local-only, no cloud sync, no vendor dependency. The encrypted database file is yours; sync between devices is achieved by syncing the file via Syncthing or Proton Drive if you want multi-device access, or by accepting single-device discipline. The trade-off relative to a cloud password manager (1Password, Bitwarden) is convenience: KeePassXC requires you to handle your own backup and sync, whereas cloud managers do it for you while taking on the OPSEC cost of holding your full credential set.

YubiKey is the hardware security key for 2FA. $25–55 depending on model. One-time cost. The operational benefit is total elimination of the SIM-swap and SMS-2FA attack vector for high-value accounts: once a YubiKey is enrolled, account compromise requires physical possession of the key, not just compromise of the phone number associated with the account. For any account whose compromise would be analytically or commercially significant, YubiKey enrolment is non-negotiable. Buy at least two keys (a primary and a backup) so that key loss does not lock you out.

Tailscale is zero-configuration WireGuard mesh networking for connecting personal devices and self-hosted infrastructure. Free for personal use up to 3 users and 100 devices. For analysts running self-hosted tools — Obsidian sync, local AI inference, private databases, personal n8n automation — Tailscale enables secure remote access between devices without exposing services to the public internet. The architectural value is that your private analytical infrastructure stays private: it is reachable from your laptop, your phone, and your home server, but not from the internet at large.

Librewolf (a privacy-hardened Firefox fork) or Firefox with the arkenfox user.js provides a collection-browsing environment separate from your public/personal browsing identity. Free, open-source. The operational separation matters: your collection browser should not share fingerprint, cookies, or session state with your personal browsing. Sandbox the collection identity.

Qubes OS is the compartmentalisation operating system: every application runs in an isolated Xen virtual machine, and you organise work into colour-coded VMs (a “work” VM, a “personal” VM, a “research” VM, a “vault” VM) with controlled inter-VM file transfer. The strongest desktop OPSEC architecture available. The learning curve is high; the hardware overhead is real (Qubes wants 16+ GB RAM and a fast SSD). Qubes is appropriate for analysts who face persistent adversarial interest — investigative journalists working on organised crime or state-actor subjects, for example. For most independent practitioners working on lower-sensitivity subjects, Qubes is overkill, and the maintenance burden makes it operationally counter-productive.


3. Tool Selection Decision Framework

When a tool is needed for a category, the selection decision is a function of seven variables. The matrix below makes the trade-offs explicit.

3.1 Decision criteria

CriterionWhat it captures
Open-source vs. proprietaryAuditability, lock-in risk, fork-ability
Cost tierFree / freemium / paid-required / enterprise-only
OPSEC riskServer-side processing vs. local; vendor data handling; jurisdiction
Coverage / qualityIndex size, source breadth, output fidelity
Maintenance burdenUpdate cadence, breaking changes, integration cost
Community supportDocumentation, issue trackers, peer expertise
Alternative availableSubstitutability — is there a fallback?

The principles from Section 1 set the weights: open-source is preferred on auditability and lock-in grounds; OPSEC risk is weighted up where collection touches sensitive subjects; maintenance burden is weighted up at solo scale; coverage is weighted up where institutional alternatives would obviously win on capability but are out of reach.

3.2 Reference selection matrix

CategoryRecommended toolOS / PropCostOPSECBest forAlternative
Feed aggregationInoreader ProProprietary$9.99/moModerateRule-based routingFeedly
Telegram monitoringTelegram Desktop + tgstatMixedFree / freemiumModerateChannel coverageBot API self-host
Internet device scanShodanProprietary$49/moModerateCVE search, servicesCensys
Passive DNSCIRCL + VirusTotalMixedFreeModerateDomain historyDNSDB (paid)
Certificate transparencycrt.shOpen accessFreeLowSubdomain enumerationCensys certs
BGP / ASNBGP.toolsOpen accessFreeLowRouting analysisRIPE RIS, BGPView
Sanctions screeningOpenSanctionsOpenFreeLowDaily-updated aggregationDirect OFAC SDN
Corporate registryOpenCorporatesOpen dataFreeLowGlobal officer searchPer-jurisdiction registries
OffshoreICIJ Offshore LeaksOpenFreeLowPanama/Pandora pivotOCCRP Aleph
Investigative docsOCCRP AlephOpen (registration)FreeLowCross-source entity searchNational FOIA
US public filingsSEC EDGAR (EFTS)OpenFreeLowFull-text 10-K/8-K
US court recordsCourtListener + PACERMixedFree / per-pageLowFederal docketsTrellis (state)
Free satelliteSentinel Hub EO BrowserOpen dataFreeLow10m, 5-day revisitGoogle Earth Pro
Commercial satellitePlanet LabsProprietaryPaidModerateDaily 3mMaxar Open Data (event)
Solar positionSunCalc.netOpenFreeLowShadow-angle geo
Video verificationInVID/WeVerifyOpen (EU consortium)FreeModerateKeyframe + multi-engine RIS
Reverse imageYandex + Google + BingProprietaryFreeModerateMulti-engine mergeTinEye
Image forensicsFotoForensicsClosed (web)FreeServer-uploadELASelf-hosted tools
MetadataExifToolOpenFreeLocalAll formats
Video archiveyt-dlpOpenFreeLocal + server1,000+ sites
Translation (EU)DeepL ProProprietary$9.99/moServer-sideDE/FR/ES/RU qualityGoogle Translate
Translation (other)Google TranslateProprietaryFreeServer-sideAR/FA/ZH/JA/KO gistClaude/Gemini
OCRTesseractOpenFreeLocal100+ languagesCloud Vision APIs
PDF tablesCamelot / TabulaOpenFreeLocalStructured extraction
Reference mgmtZoteroOpenFreeLocal + opt syncPDF annotation, citations
Knowledge baseObsidianClosed source (free)FreeLocal-firstVault architectureJoplin
LLM (hosted)Claude API (ZDR)Proprietary$30–100/moStrong (ZDR)Drafting, review, extractionGemini Vertex
LLM (local)Ollama + Llama-3-70BOpen weightsHardwareStrongestSensitive bulk processingMistral Large
Encrypted messagingSignalOpenFreeStrongSource comms
Email + VPNProton SuiteMixedFreemiumStrongE2E mail, VPNTutanota + Mullvad
Password managerKeePassXCOpenFreeStrongestLocal-only credentials1Password (cloud)
Hardware 2FAYubiKeyProprietary$25–55 onceStrongestSIM-swap mitigation
Personal mesh VPNTailscaleClosed source (free tier)FreeStrongSelf-host accessHeadscale (FOSS)
Hardened browserLibrewolfOpenFreeStrongCollection identityFirefox + arkenfox
Compartmentalised OSQubes OSOpenFreeStrongestVM isolation

The matrix is a starting point, not a prescription. Where your work has unusual requirements — extreme sensitivity, specific jurisdictional coverage, a particular collection subject — substitute accordingly. The point of the matrix is that each substitution should be a deliberate selection against the seven criteria, not a default.


4. What Not to Build / What Not to Buy

Independent analyst stacks fail in characteristic ways. The four most common failures are catalogued below.

4.1 Overbuilding infrastructure

The most common failure is the analyst who spends more time maintaining their stack than using it. Self-hosting is seductive: every commercial tool has an open-source equivalent that you can run yourself, and at first glance each self-host saves a subscription fee. In practice, the maintenance cost of self-hosted tools on single-person infrastructure is real, and it is paid in the currency that matters most — analytical hours.

The diagnostic question: how much time per week do you spend on infrastructure work (updates, debugging, integration, backups, monitoring)? If the answer is more than a few hours, your stack is over-built relative to your analytical throughput. The corrective is to retire self-hosted tools whose cloud equivalent is cheap, reliable, and OPSEC-acceptable for the work in question.

The exception is infrastructure whose existence is non-negotiable for OPSEC reasons — local AI inference for sensitive collection, a private knowledge base, an air-gapped credential store. For these, self-hosting is the architecture; the maintenance cost is a fixed cost of the threat model.

4.2 Enterprise tools without enterprise budgets

Palantir Gotham. Babel Street. Refinitiv/LSEG World-Check. LexisNexis Risk Solutions. Maltego enterprise. Recorded Future. ZeroFox. The list of tools that cost $30,000–$250,000 per seat per year is long, and the temptation to find a way to access them — through trial accounts, through institutional partnerships, through grey-market arrangements — is recurrent.

The temptation should be resisted on two grounds. First, the price point makes sustained access impossible for an independent practitioner; the work becomes dependent on a tool you cannot keep, which is a structural fragility. Second, and more importantly, these tools are not the right tools for independent practice. They are architected for institutional teams with workspace-sharing, data-governance, and downstream integration with classified or proprietary feeds. The independent practitioner who somehow accessed them would discover that their analytical workflow does not align with the tool’s assumptions and that the marginal capability over the free-tier and open-source stack is smaller than the marketing implies.

The successful independent practice is built on the free-tier and low-cost commercial stack documented above. It is not a constrained substitute for the institutional stack; it is a different architecture that fits a different operating model.

Building custom scrapers for platforms with Terms of Service restrictions creates legal exposure that the independent practitioner is poorly positioned to absorb. LinkedIn, Facebook, Instagram, X/Twitter, TikTok — all major platforms have explicit ToS restrictions on automated collection, and several have pursued litigation against scraping operations. The legal landscape is unsettled (see hiQ v. LinkedIn and its aftermath, where the legal status of scraping public profiles oscillated through multiple appellate decisions), but “the law is unsettled” is not a safe operational posture for a solo practitioner without a litigation budget.

The operational alternatives:

  • API-based access where the platform offers it, even on restrictive terms. The trade-off is rate limits and content gaps; the benefit is legal cover.
  • Research partnerships with institutional researchers who have data-access agreements (academic institutions, journalism collectives). The trade-off is collaboration overhead; the benefit is access through a legitimate channel.
  • Manual collection at reasonable cadence for high-value targets, with documented research purpose. Manual collection at human-scale rate is not what platform legal teams pursue.
  • Use of platform-archive services (Bellingcat’s tooling, Wayback Machine, archive.today) which have established themselves under fair-use and research-purpose frameworks.

The risk-acceptance posture is: do not build infrastructure whose ongoing operation creates legal exposure you cannot absorb if the platform escalates.

4.4 Trusting free tools with sensitive data

Free tiers of commercial tools are not gifts. The vendor’s economic model has to pay for the infrastructure, and free-tier user activity is one of the inputs to that model — through advertising, through aggregate data sales, through training-data collection, or through the funnel into paid conversion.

For low-sensitivity collection, this is acceptable. For sensitive collection, it is not. The discipline is to classify each tool by data handling:

  • Local processing: data does not leave your machine (ExifTool, Tesseract, Camelot, KeePassXC, Ollama, local Joplin/Obsidian).
  • Server-side with strong data handling: vendor has clear, contractually-binding data-handling commitments (Proton, Signal, Claude with ZDR enabled, paid Tailscale).
  • Server-side with permissive data handling: vendor’s free-tier terms allow them to use your activity broadly (Google Translate, free Inoreader/Feedly, free reverse-image services, many other freemium tools).

Match the tool’s data-handling tier to the sensitivity of the data you put through it. There is nothing wrong with running Google Translate on a public press release. There is everything wrong with running it on a confidential source’s identifying details.


5. Maintenance and Version Control for Your Stack

A stack is a working artefact, and like any artefact it accumulates entropy. Maintenance discipline keeps the stack useful instead of letting it decay into a collection of half-configured, half-broken, half-remembered tools.

5.1 Document your stack

Maintain a private reference note — in your vault, not in the public layer — that records for each tool: name, version (where applicable), installation method, configuration paths, account / API credentials (referenced by name to your password manager, never inline), licensing tier and renewal date, and the analytical purpose it serves. The discipline is not optional. The cost of forgetting which version of a tool you had configured how, six months ago when it last worked, is paid the next time it breaks.

Suggested location: 09 Repository/Operations/Stack-Inventory.md (private, never published).

5.2 Track breaking changes

For tools you depend on, subscribe to release notifications:

  • GitHub releases (use the “watch → custom → releases” feature for repositories of tools you rely on).
  • Vendor announcement channels — change blogs, status pages, email release notes.
  • Community channels for major tools — the Obsidian forum, OSINT-tool community Discord servers, relevant subreddits.

The point is not to read every release note but to be alerted that a release happened, so that the discovery of a breaking change is not when it breaks your workflow mid-investigation.

5.3 Backup configurations

Configurations are part of the stack. Store them in a private, encrypted backup:

  • Browser profiles (with credential stores extracted into the password manager, not in browser storage)
  • Obsidian vault configuration (the .obsidian/ directory: workspace, plugins, themes)
  • Zotero library and settings
  • Tool-specific configurations — ExifTool config files, custom Inoreader rules, n8n workflows, Jupyter kernel configs
  • Shell configurations — .bashrc, .zshrc, custom scripts, aliases
  • Editor configurations — vim/neovim/VSCode settings

A modest discipline like a daily encrypted snapshot to a self-hosted backup target (Restic to a Tailscale-connected NAS, for example) protects against the bad outcomes — laptop loss, drive failure, configuration corruption — that otherwise wipe out months of stack tuning.

5.4 Evaluate the stack annually

Once a year, audit the stack:

  • Remove tools you no longer use. A tool unused for six months has no analytical claim on your maintenance time.
  • Identify capability gaps. What kind of work did you turn down or do badly in the last year because you lacked the right tool? Address it.
  • Identify maintenance over-spend. Which tools cost the most maintenance relative to analytical return? Replace or retire.
  • Re-check OPSEC posture. Vendors change data-handling policies. The tool you adopted on strong terms two years ago may be operating under weaker terms today.
  • Re-check cost. Subscription prices drift upward. Each paid tool should still pass the “would I pay this if it were new?” test.

The annual audit is the discipline that prevents a stack from becoming what every long-running analyst’s stack tends to become: a strata of tools accumulated over time, half of which are no longer optimal for current work and none of which are actively maintained.


6. Closing Reflection — The Stack as Analytical Asset

The tool stack is not the analyst. The judgement is the analyst.

This chapter has catalogued tools because the question “what should I use” is asked frequently and deserves a serious answer. But the answer should be held in proportion. An analyst with weak judgement and the best stack in the world produces poor analysis with elaborate infrastructure. An analyst with strong judgement and a modest stack produces strong analysis. The tools are amplifiers; what they amplify is whatever capability is already there.

The independent practitioner’s competitive position is built primarily on judgement, subject-matter depth, source-cultivation, and analytical discipline. The stack is in service of those. A practitioner who chooses tools with the principles in Section 1, organises them by function as in Section 2, selects deliberately against the criteria in Section 3, avoids the failures in Section 4, and maintains them as in Section 5 — that practitioner has done the necessary work and can return to the analytical work itself.

The stack is finite, knowable, and instrumental. The analysis is unbounded, demanding, and the point of the practice.


Key Connections

Series cross-links

Reference notes referenced from this chapter

Concepts