Python OSINT Automation Guide

BLUF

Python is the dominant scripting language for OSINT automation (Fact) — its position is the result of three converging factors: an unusually deep library ecosystem covering virtually every collection vector an analyst encounters (HTTP, HTML/JS, APIs, NLP, graph theory, geospatial, image metadata); rapid prototyping syntax that lets an analyst go from idea to working script in minutes rather than hours; and a community that has standardized on Python for nearly every public OSINT framework worth installing (theHarvester, Recon-ng, Spiderfoot, Maltego’s TRX SDK, Sherlock, Holehe, Maigret, Sherlock — all Python). The core use cases for a practicing intelligence analyst are bulk collection (sweeping hundreds of usernames, domains, or IPs against APIs that would be impractical to query manually), enrichment pipelines (chaining WHOIS → DNS → Shodan → VirusTotal lookups against a target list), de-duplication and triage (collapsing thousands of news items into a manageable signal set), automated alerting (cron + script + Telegram bot for threshold-based notifications), and network visualization (building entity-link registers programmatically and exporting to Gephi or Neo4j). Python OSINT automation is not a replacement for analyst judgment (Assessment) — it is a force-multiplier that handles the mechanical layer of collection so the analyst can spend cognitive bandwidth on verification, hypothesis testing, and synthesis. The analyst who automates collection without rigorous verification is producing higher-volume garbage faster; the analyst who automates collection with verification discipline produces an order of magnitude more high-quality output than a manual workflow.

This guide assumes a working Python 3.11+ installation on a Linux host (Parrot / Debian / Ubuntu typical for the PIA stack), familiarity with pip and virtual environments, and at least intermediate Python literacy. It is not a Python tutorial — it is an operational reference for analysts already capable of reading and writing Python code.


Core Libraries — The OSINT Python Stack

The libraries below constitute the working stack for the vast majority of OSINT automation work. Install all of these in a dedicated virtual environment (python -m venv osint-env && source osint-env/bin/activate).

LibraryPurposeInstall
requests / httpxHTTP requests, API calls, web fetching; httpx adds HTTP/2 + asyncpip install requests httpx
BeautifulSoup4 / lxmlHTML/XML parsing and scraping; lxml is the fast parser backendpip install beautifulsoup4 lxml
Selenium / PlaywrightHeadless browser automation for JS-heavy sites; Playwright is the modern choicepip install playwright && playwright install
ScrapyFull-featured scraping framework for large-scale collectionpip install scrapy
pandasData cleaning, deduplication, pivot analysispip install pandas
networkxGraph analysis for relationship mappingpip install networkx
matplotlib / plotlyVisualizationpip install matplotlib plotly
shodanOfficial Shodan Python SDKpip install shodan
tweepyTwitter/X API clientpip install tweepy
spacy / nltkNLP for entity extraction from textpip install spacy nltk && python -m spacy download en_core_web_sm
geopyGeocoding and geospatial calculationspip install geopy
tqdmProgress bars for long-running jobspip install tqdm
python-dotenvCredential management via .env filespip install python-dotenv
celery + redisTask queuing for large async pipelinespip install celery redis
telethonTelegram channel monitoring via MTProtopip install telethon
python-whoisWHOIS lookupspip install python-whois
dnspythonDNS resolution and zone queryingpip install dnspython
exifread / PillowImage metadata extractionpip install exifread Pillow

Assessment: The minimal viable OSINT stack is requests + BeautifulSoup + pandas + python-dotenv. Everything else is added as specific collection requirements emerge. Resist the temptation to install the entire stack pre-emptively — dependency bloat creates maintenance pain and slows venv rebuilds.


Credential Management and OPSEC

The single most common operational failure in analyst-written OSINT scripts is hardcoded API keys (Assessment) — typically discovered when a script is committed to a public repo or shared informally with a colleague. Treat API keys as you would any other operational credential.

Standing rules:

  • Never hardcode API keys. Use python-dotenv to load credentials from a .env file kept outside version control, or use environment variables set in your shell profile.
  • One .env per project / investigation. This enables key rotation per investigation and limits blast radius if a key leaks.
  • Add .env to .gitignore before the first commit. This is irreversible if missed — even a deleted file remains in git history. Use git-secrets or trufflehog as pre-commit hooks for additional defense in depth.
  • Rotate keys regularly. Shodan, VirusTotal, HIBP, and most other providers allow key regeneration from the dashboard. Quarterly rotation is reasonable for active analysts.
  • Rate-limit aggressively. Most free-tier API bans result from naive scripts exceeding limits in seconds. Implement per-request sleep or use a rate-limiting library like ratelimit.
  • Proxy rotation for scraping. For sustained scraping against the same target, use rotating residential proxies (Bright Data, Smartproxy) or, for .onion services, the requests[socks] package with Tor (requests-tor). For collection methodology across dark web and .onion environments, see DARKINT.
  • User-agent spoofing. A bare python-requests/2.31.0 UA is the fastest path to a 403. Rotate among realistic browser UAs (a small list of 5–10 current Chrome/Firefox strings is sufficient).
  • Separate API keys per investigation. If keys leak or get burned, only one investigation’s collection trail is compromised.
  • Log which key is used for what. A simple audit.log recording timestamp + target + key alias supports both troubleshooting and post-hoc OPSEC review.
# .env (kept outside git)
SHODAN_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
VT_API_KEY=yyyyyyyyyyyyyyyyyyyyyyyy
HIBP_API_KEY=zzzzzzzzzzzzzzzzzzzzzzzz
 
# loader pattern
from dotenv import load_dotenv
import os
load_dotenv()
shodan_key = os.getenv("SHODAN_API_KEY")

For PIA-scale operations, store all API keys as secure notes in Proton Pass under the pia/api-keys tag; .env files on disk are working copies, not the source of truth.


Web Scraping Patterns

Static HTML

The baseline pattern for any HTML target that doesn’t require JavaScript execution:

import requests
from bs4 import BeautifulSoup
 
def fetch_and_parse(url: str) -> BeautifulSoup:
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) ..."}
    r = requests.get(url, headers=headers, timeout=15)
    r.raise_for_status()
    return BeautifulSoup(r.text, "lxml")

This handles 70%+ of public-web targets (Assessment). It will fail against any site that lazy-loads content via JavaScript or implements anti-bot measures (Cloudflare, DataDome, PerimeterX) — for those, escalate to Playwright.

JavaScript-heavy sites (Playwright)

Playwright is the current best-in-class headless browser automation library for Python. It is the recommended replacement for Selenium in 2026 — better async support, simpler API, more reliable selectors.

from playwright.sync_api import sync_playwright
 
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(url)
    page.wait_for_selector(".target-element")
    content = page.content()
    browser.close()

For sustained scraping with Playwright, use the async API (async_playwright) and a connection pool. For stealth against anti-bot systems, layer playwright-stealth on top.

Scrapy spider pattern

Scrapy is the right tool when you need to crawl a site at scale — thousands of pages, with pagination, deduplication, and structured output. The mental model is different from requests + BeautifulSoup: you define a Spider class with start_urls and parse() callbacks, and Scrapy handles concurrency, retries, and pipelines.

import scrapy
 
class TargetSpider(scrapy.Spider):
    name = "target"
    start_urls = ["https://example.com/list"]
 
    def parse(self, response):
        for item in response.css(".item"):
            yield {
                "title": item.css("h2::text").get(),
                "link":  item.css("a::attr(href)").get(),
            }
        next_page = response.css(".next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run with scrapy crawl target -o output.jsonl. For investigation-scale work (under 1,000 pages), requests + BeautifulSoup is usually faster to write; Scrapy pays off above that threshold.

Rate limiting

The discipline that distinguishes a professional scraper from a script-kiddie crawler:

  • Exponential backoff on 429 (Too Many Requests) and 503 responses — wait 2^n seconds before retry, max 5 retries.
  • Respect robots.txt unless you have explicit justification to ignore it. Use urllib.robotparser to parse it programmatically.
  • Implement a global delay between requests — time.sleep(random.uniform(2, 5)) between page fetches is a reasonable starting point for most public-web targets.
  • Concurrent connections cap. Even when async, cap to 4–8 concurrent connections per target host. Higher rates trigger anti-bot.

API Integration Patterns

Shodan integration

Shodan is the canonical infrastructure-OSINT API. The Python SDK is thin and well-documented.

import shodan, os
api = shodan.Shodan(os.getenv("SHODAN_API_KEY"))
results = api.search("apache country:BR")
for r in results["matches"]:
    print(r["ip_str"], r["port"], r.get("org", ""))

For sustained queries, use api.search_cursor() (server-side pagination) rather than search() to avoid the 1,000-result cap. See Shodan-Censys Guide for query syntax depth.

HaveIBeenPwned / data breach lookup

HIBP v3 API requires an API key (~$3.50/month for hobbyist tier). Key constraints:

  • Rate limit: 1 request per 1.5 seconds for the email lookup endpoint (paid).
  • Headers required: hibp-api-key + a descriptive User-Agent (HIBP rejects generic UAs).
  • Endpoint: https://haveibeenpwned.com/api/v3/breachedaccount/{email}.
import requests, time
def hibp_check(email: str, api_key: str) -> list:
    url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
    headers = {"hibp-api-key": api_key, "User-Agent": "PIA-OSINT-Script"}
    r = requests.get(url, headers=headers, params={"truncateResponse": "false"})
    time.sleep(1.6)  # respect rate limit
    if r.status_code == 404:
        return []  # no breaches
    r.raise_for_status()
    return r.json()

WHOIS / DNS lookups

import whois
import dns.resolver
 
w = whois.whois("example.com")
print(w.registrar, w.creation_date, w.emails)
 
answers = dns.resolver.resolve("example.com", "MX")
for r in answers:
    print(r.exchange, r.preference)

Gap: python-whois returns inconsistent field names across TLDs — .com returns different keys than .ru or .cn. Build a normalization layer if you’re processing mixed-TLD batches.

IP geolocation

  • ip-api.com — free, 45 req/min limit, no key required. Good for ad-hoc lookups.
  • MaxMind GeoIP2 — commercial; local database file (no API calls); the standard for bulk processing.
  • ipinfo.io — freemium; cleaner API than ip-api; 50k req/mo free tier.
# MaxMind GeoIP2 — local DB
import geoip2.database
reader = geoip2.database.Reader("/path/to/GeoLite2-City.mmdb")
response = reader.city("8.8.8.8")
print(response.country.iso_code, response.city.name)

Telegram scraping via Telethon

Telethon enables programmatic access to Telegram via the MTProto protocol — far more capable than the Bot API. Use cases include conflict-zone channel monitoring (essential for Ukraine/Gaza/Sudan OSINT), IO campaign tracking, threat-actor channel surveillance, and forwarded-message graph construction.

from telethon import TelegramClient
 
async with TelegramClient("session_name", api_id, api_hash) as client:
    async for msg in client.iter_messages("channel_username", limit=100):
        print(msg.date, msg.sender_id, msg.text)

OPSEC imperative: Use a dedicated Telegram account for OSINT collection, never your personal account. Register with a burner phone number (MySudo, Hushed, or a virtual SIM service). Telegram links sessions to phone numbers and accounts can be deanonymized by adversaries who control the channels you monitor.


Data Enrichment Pipeline Example

A canonical enrichment pipeline for a domain-target investigation:

  1. Input: list of target domains from CSV (perhaps from a phishing report or initial reconnaissance).
  2. Enrich (sequential lookups per domain):
    • WHOIS — registrant, dates, registrar
    • DNS records — A, MX, NS, TXT
    • Shodan exposure — open ports, banners on the A record IP
    • VirusTotal reputation — historical detections, passive DNS
  3. Store: SQLite via sqlite3 stdlib module or pandas to_sql(); SQLite is the right choice for investigations — file-based, no daemon, queryable from the analyst’s laptop.
  4. Output: CSV for analyst review + summary Markdown report for the vault.
graph LR
    A[Target List CSV] --> B[WHOIS Lookup]
    B --> C[DNS Records]
    C --> D[Shodan Exposure]
    D --> E[VT Reputation]
    E --> F[SQLite Store]
    F --> G[CSV Export]

Implementation notes:

  • Caching: cache each external API response keyed by (target, api_name, date). This makes re-runs cheap and supports incremental updates.
  • Errors are data: when a lookup fails (404, timeout, key error), log the failure in the database — don’t silently drop the row. Gaps are analytically significant.
  • Resumability: structure the pipeline so it can resume after an interrupted run. SQLite + idempotent upserts on (target, api_name) is the simplest pattern.
  • Progress visibility: wrap the outer loop in tqdm so the analyst sees progress and ETA on long batches.

Entity Extraction and NLP

For processing scraped text — news articles, leaked documents, scraped social media — Python’s NLP libraries enable systematic extraction of named entities (persons, organizations, locations) that would be impractical to extract manually.

spaCy is the recommended starting point: pre-trained models for English, Russian, Chinese, Arabic, and 60+ other languages; reasonable accuracy out-of-the-box; fast inference.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(article_text)
for ent in doc.ents:
    if ent.label_ in {"PERSON", "ORG", "GPE"}:
        print(ent.text, ent.label_)

NLTK complements spaCy for text preprocessing — tokenization, stop-word removal, stemming, basic sentiment. NLTK is older and slower but has unique utilities (e.g., concordance views) that spaCy lacks.

Regex remains essential for OSINT-specific extraction patterns that NER models miss:

import re
EMAIL_RE   = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
IP_RE      = r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'
BITCOIN_RE = r'[13][a-km-zA-HJ-NP-Z1-9]{25,34}'
ETH_RE     = r'0x[a-fA-F0-9]{40}'
PHONE_RE   = r'\+?[0-9]{1,3}[\s.-]?\(?\d{1,4}\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}'

Assessment: Always combine NER model output with regex pass — NER misses crypto addresses and structured identifiers; regex misses contextual entities (e.g., “the Wagner Group” as ORG rather than two tokens). Run both and merge. The extracted entity set feeds into Entity Resolution Methodology workflows for cross-source identity disambiguation and deduplication.


Network Graph Analysis with NetworkX

When entity extraction produces a population of entities and observed relationships, networkx is the standard Python library for constructing and analyzing the resulting graph.

import networkx as nx
G = nx.Graph()
G.add_edge("Actor A", "Actor B", weight=3, source="article-2026-05-01")
G.add_edge("Actor B", "Actor C", weight=1, source="leak-2025-12")
# centrality
nx.degree_centrality(G)
nx.betweenness_centrality(G)
nx.eigenvector_centrality(G)

Analytical uses:

  • Degree centrality — who has the most direct connections? Identifies the network’s hubs.
  • Betweenness centrality — who sits on the most paths between other nodes? Identifies brokers, intermediaries, gatekeepers — analytically the highest-value targets in many CT/CN networks.
  • Eigenvector centrality — who is connected to other well-connected nodes? Identifies the network’s “elite” — actors whose influence is amplified by their associates’ influence.
  • Community detection (Louvain algorithm via python-louvain or networkx.algorithms.community) — partition the graph into densely-connected sub-communities. Each community typically corresponds to an operational cell, functional team, or social cluster.

Export to Gephi for visualization on large graphs (>500 nodes) where matplotlib breaks down:

nx.write_gexf(G, "network.gexf")  # opens directly in Gephi

See Network Analysis Methodology and Link Analysis for the analytical frameworks around centrality interpretation and graph-based entity relationships.


Scheduling and Automation

Automation discipline distinguishes a one-off script from an operational pipeline.

  • cron (Linux/Mac) — the default for recurring collection. crontab -e to add a job; output to a log file (>> /var/log/osint.log 2>&1).
  • APScheduler — in-process scheduler for Python; useful when collection logic is tightly coupled to other Python state (e.g., a long-running monitor that should run a sub-task every 15 minutes).
  • systemd timers — modern alternative to cron on Linux; better logging via journalctl, recommended on Parrot/Debian for new automation.
  • n8n integration — for the PIA stack, n8n on the Pi is the canonical orchestrator; Python scripts run via the Execute Command node, or expose a webhook that n8n calls. See the existing Inoreader→Claude→Obsidian n8n pipeline for reference. For LLM-augmented OSINT workflows within n8n, see LLM-OSINT-SOP-A2IC.
  • Alerting:
    • smtplib for email notification (use Proton Mail Bridge for OPSEC).
    • Telegram bot API for push notifications — fastest for analyst-on-the-go alerting; create a bot via @BotFather, then requests.post(f"https://api.telegram.org/bot{TOKEN}/sendMessage", ...).

Operational Discipline

The difference between a script and an OSINT product is operational discipline. Standing rules:

  • Version-control your OSINT scripts in a private repo (self-hosted Gitea, private GitHub). Code is intelligence infrastructure.
  • Log every collection run: timestamp, target, result count, errors, key alias used. A single audit.log per investigation is the minimum.
  • Deduplication via content hashing: MD5 or SHA256 over scraped content; if the hash matches a prior run, skip storage but log the duplicate hit (duplicates are often signal — they indicate either re-publication or scraping a stale cache).
  • Chain-of-custody: for every collected artifact, record URL, retrieval timestamp (UTC, ISO-8601), HTTP status, and tool version. Apply Source Verification Framework evaluation standards to each source before downstream analysis. This is the difference between an artifact that’s usable in an analytical product and one that’s discardable. Mirror to the vault Evidence pack via the osint-evidence-pack convention.
  • Reproducibility: pin dependencies in requirements.txt with exact versions. The script that worked six months ago will not work today unless dependencies are pinned.
  • Idempotency: design every collection step to be safely re-runnable. Use upserts, not inserts; check existence before fetching.

Gap: Python OSINT automation is not covered by most legal frameworks’ explicit definitions of “collection” — courts and regulators have largely failed to keep pace with the techniques described in this guide. Legal exposure varies significantly by jurisdiction (CFAA in the US, the GDPR in the EU, the LGPD in Brazil, Computer Misuse Act in the UK). Scraping a public website may or may not be lawful depending on the site’s ToS, the jurisdiction, and the purpose; API access under a free tier may carry contractual restrictions that survive even if no statutory law applies. See OSINT Legal Framework for the analytical/legal framework. When in doubt, consult counsel before scaling a collection operation. (Fact: hiQ v. LinkedIn established that scraping public profiles is not a CFAA violation in the US — but the underlying state-law and contract claims survived.)


Toolkits and Frameworks Built on Python

The Python OSINT ecosystem includes mature frameworks that abstract much of the work described above. These are worth installing and reading even if your final workflow uses custom scripts — they are reference implementations of well-engineered OSINT automation.

ToolFunctionNotes
theHarvesterDomain/email/host enumerationPassive recon; queries dozens of search engines and APIs; first stop for a new domain investigation
Recon-ngModular OSINT frameworkBuilt on Python; Metasploit-style module system; integrates with major API providers
Maltego TRXCustom transform developmentPython SDK for building custom Maltego transforms — bridges Python pipelines into the Maltego graph
SpiderfootAutomated OSINT footprintWeb UI + API; 200+ collection modules; the closest thing to “OSINT as a service” you can self-host
DatasploitMulti-source correlationActive + passive recon; less maintained than alternatives but useful for username/domain pivots
OSINT-SPYTarget profilingUsername/email/IP profiling; thinner than alternatives but easy to embed in pipelines
SherlockUsername enumeration across 400+ sitesThe standard for username sweeps; see osint-username
HoleheEmail enumeration across servicesChecks if an email is registered on 100+ platforms without sending reset emails
MaigretSherlock-like username sweepMore sites, slower, better reporting

Key Connections


Sources

  • Python Software Foundation — Python 3 Documentation (python.org/doc) — High (primary)
  • Shodan — API Documentation (shodan.readthedocs.io) — High (primary)
  • Scrapinghub — Scrapy Documentation (docs.scrapy.org) — High (primary)
  • NetworkX Developers — NetworkX Documentation (networkx.org) — High (primary)
  • Microsoft — Playwright for Python (playwright.dev/python) — High (primary)
  • Telethon Developers — Telethon Documentation (docs.telethon.dev) — High (primary)
  • Explosion AI — spaCy Documentation (spacy.io) — High (primary)
  • Singh, Justin — Hands-On Python for Cybersecurity and OSINT (Packt, 2021) — Medium (practitioner text)
  • Bazzell, Michael — Open Source Intelligence Techniques, 10th ed. — Medium (practitioner reference, includes Python sections)
  • SANS Institute — SEC487: Open-Source Intelligence Gathering and AnalysisHigh (course materials)
  • HaveIBeenPwned — API v3 Specification (haveibeenpwned.com/API/v3) — High (primary)
  • hiQ Labs, Inc. v. LinkedIn Corp., 9th Cir. (2019, 2022) — High (case law for US scraping legality)