Python OSINT Automation Guide
BLUF
Python is the dominant scripting language for OSINT automation (Fact) — its position is the result of three converging factors: an unusually deep library ecosystem covering virtually every collection vector an analyst encounters (HTTP, HTML/JS, APIs, NLP, graph theory, geospatial, image metadata); rapid prototyping syntax that lets an analyst go from idea to working script in minutes rather than hours; and a community that has standardized on Python for nearly every public OSINT framework worth installing (theHarvester, Recon-ng, Spiderfoot, Maltego’s TRX SDK, Sherlock, Holehe, Maigret, Sherlock — all Python). The core use cases for a practicing intelligence analyst are bulk collection (sweeping hundreds of usernames, domains, or IPs against APIs that would be impractical to query manually), enrichment pipelines (chaining WHOIS → DNS → Shodan → VirusTotal lookups against a target list), de-duplication and triage (collapsing thousands of news items into a manageable signal set), automated alerting (cron + script + Telegram bot for threshold-based notifications), and network visualization (building entity-link registers programmatically and exporting to Gephi or Neo4j). Python OSINT automation is not a replacement for analyst judgment (Assessment) — it is a force-multiplier that handles the mechanical layer of collection so the analyst can spend cognitive bandwidth on verification, hypothesis testing, and synthesis. The analyst who automates collection without rigorous verification is producing higher-volume garbage faster; the analyst who automates collection with verification discipline produces an order of magnitude more high-quality output than a manual workflow.
This guide assumes a working Python 3.11+ installation on a Linux host (Parrot / Debian / Ubuntu typical for the PIA stack), familiarity with pip and virtual environments, and at least intermediate Python literacy. It is not a Python tutorial — it is an operational reference for analysts already capable of reading and writing Python code.
Core Libraries — The OSINT Python Stack
The libraries below constitute the working stack for the vast majority of OSINT automation work. Install all of these in a dedicated virtual environment (python -m venv osint-env && source osint-env/bin/activate).
| Library | Purpose | Install |
|---|---|---|
requests / httpx | HTTP requests, API calls, web fetching; httpx adds HTTP/2 + async | pip install requests httpx |
BeautifulSoup4 / lxml | HTML/XML parsing and scraping; lxml is the fast parser backend | pip install beautifulsoup4 lxml |
Selenium / Playwright | Headless browser automation for JS-heavy sites; Playwright is the modern choice | pip install playwright && playwright install |
Scrapy | Full-featured scraping framework for large-scale collection | pip install scrapy |
pandas | Data cleaning, deduplication, pivot analysis | pip install pandas |
networkx | Graph analysis for relationship mapping | pip install networkx |
matplotlib / plotly | Visualization | pip install matplotlib plotly |
shodan | Official Shodan Python SDK | pip install shodan |
tweepy | Twitter/X API client | pip install tweepy |
spacy / nltk | NLP for entity extraction from text | pip install spacy nltk && python -m spacy download en_core_web_sm |
geopy | Geocoding and geospatial calculations | pip install geopy |
tqdm | Progress bars for long-running jobs | pip install tqdm |
python-dotenv | Credential management via .env files | pip install python-dotenv |
celery + redis | Task queuing for large async pipelines | pip install celery redis |
telethon | Telegram channel monitoring via MTProto | pip install telethon |
python-whois | WHOIS lookups | pip install python-whois |
dnspython | DNS resolution and zone querying | pip install dnspython |
exifread / Pillow | Image metadata extraction | pip install exifread Pillow |
Assessment: The minimal viable OSINT stack is requests + BeautifulSoup + pandas + python-dotenv. Everything else is added as specific collection requirements emerge. Resist the temptation to install the entire stack pre-emptively — dependency bloat creates maintenance pain and slows venv rebuilds.
Credential Management and OPSEC
The single most common operational failure in analyst-written OSINT scripts is hardcoded API keys (Assessment) — typically discovered when a script is committed to a public repo or shared informally with a colleague. Treat API keys as you would any other operational credential.
Standing rules:
- Never hardcode API keys. Use
python-dotenvto load credentials from a.envfile kept outside version control, or use environment variables set in your shell profile. - One
.envper project / investigation. This enables key rotation per investigation and limits blast radius if a key leaks. - Add
.envto.gitignorebefore the first commit. This is irreversible if missed — even a deleted file remains in git history. Usegit-secretsortrufflehogas pre-commit hooks for additional defense in depth. - Rotate keys regularly. Shodan, VirusTotal, HIBP, and most other providers allow key regeneration from the dashboard. Quarterly rotation is reasonable for active analysts.
- Rate-limit aggressively. Most free-tier API bans result from naive scripts exceeding limits in seconds. Implement per-request sleep or use a rate-limiting library like
ratelimit. - Proxy rotation for scraping. For sustained scraping against the same target, use rotating residential proxies (Bright Data, Smartproxy) or, for
.onionservices, therequests[socks]package with Tor (requests-tor). For collection methodology across dark web and .onion environments, see DARKINT. - User-agent spoofing. A bare
python-requests/2.31.0UA is the fastest path to a 403. Rotate among realistic browser UAs (a small list of 5–10 current Chrome/Firefox strings is sufficient). - Separate API keys per investigation. If keys leak or get burned, only one investigation’s collection trail is compromised.
- Log which key is used for what. A simple
audit.logrecording timestamp + target + key alias supports both troubleshooting and post-hoc OPSEC review.
# .env (kept outside git)
SHODAN_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
VT_API_KEY=yyyyyyyyyyyyyyyyyyyyyyyy
HIBP_API_KEY=zzzzzzzzzzzzzzzzzzzzzzzz
# loader pattern
from dotenv import load_dotenv
import os
load_dotenv()
shodan_key = os.getenv("SHODAN_API_KEY")For PIA-scale operations, store all API keys as secure notes in Proton Pass under the pia/api-keys tag; .env files on disk are working copies, not the source of truth.
Web Scraping Patterns
Static HTML
The baseline pattern for any HTML target that doesn’t require JavaScript execution:
import requests
from bs4 import BeautifulSoup
def fetch_and_parse(url: str) -> BeautifulSoup:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) ..."}
r = requests.get(url, headers=headers, timeout=15)
r.raise_for_status()
return BeautifulSoup(r.text, "lxml")This handles 70%+ of public-web targets (Assessment). It will fail against any site that lazy-loads content via JavaScript or implements anti-bot measures (Cloudflare, DataDome, PerimeterX) — for those, escalate to Playwright.
JavaScript-heavy sites (Playwright)
Playwright is the current best-in-class headless browser automation library for Python. It is the recommended replacement for Selenium in 2026 — better async support, simpler API, more reliable selectors.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
page.wait_for_selector(".target-element")
content = page.content()
browser.close()For sustained scraping with Playwright, use the async API (async_playwright) and a connection pool. For stealth against anti-bot systems, layer playwright-stealth on top.
Scrapy spider pattern
Scrapy is the right tool when you need to crawl a site at scale — thousands of pages, with pagination, deduplication, and structured output. The mental model is different from requests + BeautifulSoup: you define a Spider class with start_urls and parse() callbacks, and Scrapy handles concurrency, retries, and pipelines.
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
start_urls = ["https://example.com/list"]
def parse(self, response):
for item in response.css(".item"):
yield {
"title": item.css("h2::text").get(),
"link": item.css("a::attr(href)").get(),
}
next_page = response.css(".next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)Run with scrapy crawl target -o output.jsonl. For investigation-scale work (under 1,000 pages), requests + BeautifulSoup is usually faster to write; Scrapy pays off above that threshold.
Rate limiting
The discipline that distinguishes a professional scraper from a script-kiddie crawler:
- Exponential backoff on 429 (Too Many Requests) and 503 responses — wait 2^n seconds before retry, max 5 retries.
- Respect
robots.txtunless you have explicit justification to ignore it. Useurllib.robotparserto parse it programmatically. - Implement a global delay between requests —
time.sleep(random.uniform(2, 5))between page fetches is a reasonable starting point for most public-web targets. - Concurrent connections cap. Even when async, cap to 4–8 concurrent connections per target host. Higher rates trigger anti-bot.
API Integration Patterns
Shodan integration
Shodan is the canonical infrastructure-OSINT API. The Python SDK is thin and well-documented.
import shodan, os
api = shodan.Shodan(os.getenv("SHODAN_API_KEY"))
results = api.search("apache country:BR")
for r in results["matches"]:
print(r["ip_str"], r["port"], r.get("org", ""))For sustained queries, use api.search_cursor() (server-side pagination) rather than search() to avoid the 1,000-result cap. See Shodan-Censys Guide for query syntax depth.
HaveIBeenPwned / data breach lookup
HIBP v3 API requires an API key (~$3.50/month for hobbyist tier). Key constraints:
- Rate limit: 1 request per 1.5 seconds for the email lookup endpoint (paid).
- Headers required:
hibp-api-key+ a descriptiveUser-Agent(HIBP rejects generic UAs). - Endpoint:
https://haveibeenpwned.com/api/v3/breachedaccount/{email}.
import requests, time
def hibp_check(email: str, api_key: str) -> list:
url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
headers = {"hibp-api-key": api_key, "User-Agent": "PIA-OSINT-Script"}
r = requests.get(url, headers=headers, params={"truncateResponse": "false"})
time.sleep(1.6) # respect rate limit
if r.status_code == 404:
return [] # no breaches
r.raise_for_status()
return r.json()WHOIS / DNS lookups
import whois
import dns.resolver
w = whois.whois("example.com")
print(w.registrar, w.creation_date, w.emails)
answers = dns.resolver.resolve("example.com", "MX")
for r in answers:
print(r.exchange, r.preference)Gap: python-whois returns inconsistent field names across TLDs — .com returns different keys than .ru or .cn. Build a normalization layer if you’re processing mixed-TLD batches.
IP geolocation
- ip-api.com — free, 45 req/min limit, no key required. Good for ad-hoc lookups.
- MaxMind GeoIP2 — commercial; local database file (no API calls); the standard for bulk processing.
- ipinfo.io — freemium; cleaner API than ip-api; 50k req/mo free tier.
# MaxMind GeoIP2 — local DB
import geoip2.database
reader = geoip2.database.Reader("/path/to/GeoLite2-City.mmdb")
response = reader.city("8.8.8.8")
print(response.country.iso_code, response.city.name)Telegram scraping via Telethon
Telethon enables programmatic access to Telegram via the MTProto protocol — far more capable than the Bot API. Use cases include conflict-zone channel monitoring (essential for Ukraine/Gaza/Sudan OSINT), IO campaign tracking, threat-actor channel surveillance, and forwarded-message graph construction.
from telethon import TelegramClient
async with TelegramClient("session_name", api_id, api_hash) as client:
async for msg in client.iter_messages("channel_username", limit=100):
print(msg.date, msg.sender_id, msg.text)OPSEC imperative: Use a dedicated Telegram account for OSINT collection, never your personal account. Register with a burner phone number (MySudo, Hushed, or a virtual SIM service). Telegram links sessions to phone numbers and accounts can be deanonymized by adversaries who control the channels you monitor.
Data Enrichment Pipeline Example
A canonical enrichment pipeline for a domain-target investigation:
- Input: list of target domains from CSV (perhaps from a phishing report or initial reconnaissance).
- Enrich (sequential lookups per domain):
- WHOIS — registrant, dates, registrar
- DNS records — A, MX, NS, TXT
- Shodan exposure — open ports, banners on the A record IP
- VirusTotal reputation — historical detections, passive DNS
- Store: SQLite via
sqlite3stdlib module or pandasto_sql(); SQLite is the right choice for investigations — file-based, no daemon, queryable from the analyst’s laptop. - Output: CSV for analyst review + summary Markdown report for the vault.
graph LR A[Target List CSV] --> B[WHOIS Lookup] B --> C[DNS Records] C --> D[Shodan Exposure] D --> E[VT Reputation] E --> F[SQLite Store] F --> G[CSV Export]
Implementation notes:
- Caching: cache each external API response keyed by
(target, api_name, date). This makes re-runs cheap and supports incremental updates. - Errors are data: when a lookup fails (404, timeout, key error), log the failure in the database — don’t silently drop the row. Gaps are analytically significant.
- Resumability: structure the pipeline so it can resume after an interrupted run. SQLite + idempotent upserts on
(target, api_name)is the simplest pattern. - Progress visibility: wrap the outer loop in
tqdmso the analyst sees progress and ETA on long batches.
Entity Extraction and NLP
For processing scraped text — news articles, leaked documents, scraped social media — Python’s NLP libraries enable systematic extraction of named entities (persons, organizations, locations) that would be impractical to extract manually.
spaCy is the recommended starting point: pre-trained models for English, Russian, Chinese, Arabic, and 60+ other languages; reasonable accuracy out-of-the-box; fast inference.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(article_text)
for ent in doc.ents:
if ent.label_ in {"PERSON", "ORG", "GPE"}:
print(ent.text, ent.label_)NLTK complements spaCy for text preprocessing — tokenization, stop-word removal, stemming, basic sentiment. NLTK is older and slower but has unique utilities (e.g., concordance views) that spaCy lacks.
Regex remains essential for OSINT-specific extraction patterns that NER models miss:
import re
EMAIL_RE = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
IP_RE = r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'
BITCOIN_RE = r'[13][a-km-zA-HJ-NP-Z1-9]{25,34}'
ETH_RE = r'0x[a-fA-F0-9]{40}'
PHONE_RE = r'\+?[0-9]{1,3}[\s.-]?\(?\d{1,4}\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}'Assessment: Always combine NER model output with regex pass — NER misses crypto addresses and structured identifiers; regex misses contextual entities (e.g., “the Wagner Group” as ORG rather than two tokens). Run both and merge. The extracted entity set feeds into Entity Resolution Methodology workflows for cross-source identity disambiguation and deduplication.
Network Graph Analysis with NetworkX
When entity extraction produces a population of entities and observed relationships, networkx is the standard Python library for constructing and analyzing the resulting graph.
import networkx as nx
G = nx.Graph()
G.add_edge("Actor A", "Actor B", weight=3, source="article-2026-05-01")
G.add_edge("Actor B", "Actor C", weight=1, source="leak-2025-12")
# centrality
nx.degree_centrality(G)
nx.betweenness_centrality(G)
nx.eigenvector_centrality(G)Analytical uses:
- Degree centrality — who has the most direct connections? Identifies the network’s hubs.
- Betweenness centrality — who sits on the most paths between other nodes? Identifies brokers, intermediaries, gatekeepers — analytically the highest-value targets in many CT/CN networks.
- Eigenvector centrality — who is connected to other well-connected nodes? Identifies the network’s “elite” — actors whose influence is amplified by their associates’ influence.
- Community detection (Louvain algorithm via
python-louvainornetworkx.algorithms.community) — partition the graph into densely-connected sub-communities. Each community typically corresponds to an operational cell, functional team, or social cluster.
Export to Gephi for visualization on large graphs (>500 nodes) where matplotlib breaks down:
nx.write_gexf(G, "network.gexf") # opens directly in GephiSee Network Analysis Methodology and Link Analysis for the analytical frameworks around centrality interpretation and graph-based entity relationships.
Scheduling and Automation
Automation discipline distinguishes a one-off script from an operational pipeline.
- cron (Linux/Mac) — the default for recurring collection.
crontab -eto add a job; output to a log file (>> /var/log/osint.log 2>&1). - APScheduler — in-process scheduler for Python; useful when collection logic is tightly coupled to other Python state (e.g., a long-running monitor that should run a sub-task every 15 minutes).
- systemd timers — modern alternative to cron on Linux; better logging via
journalctl, recommended on Parrot/Debian for new automation. - n8n integration — for the PIA stack, n8n on the Pi is the canonical orchestrator; Python scripts run via the Execute Command node, or expose a webhook that n8n calls. See the existing Inoreader→Claude→Obsidian n8n pipeline for reference. For LLM-augmented OSINT workflows within n8n, see LLM-OSINT-SOP-A2IC.
- Alerting:
smtplibfor email notification (use Proton Mail Bridge for OPSEC).- Telegram bot API for push notifications — fastest for analyst-on-the-go alerting; create a bot via @BotFather, then
requests.post(f"https://api.telegram.org/bot{TOKEN}/sendMessage", ...).
Operational Discipline
The difference between a script and an OSINT product is operational discipline. Standing rules:
- Version-control your OSINT scripts in a private repo (self-hosted Gitea, private GitHub). Code is intelligence infrastructure.
- Log every collection run: timestamp, target, result count, errors, key alias used. A single
audit.logper investigation is the minimum. - Deduplication via content hashing: MD5 or SHA256 over scraped content; if the hash matches a prior run, skip storage but log the duplicate hit (duplicates are often signal — they indicate either re-publication or scraping a stale cache).
- Chain-of-custody: for every collected artifact, record URL, retrieval timestamp (UTC, ISO-8601), HTTP status, and tool version. Apply Source Verification Framework evaluation standards to each source before downstream analysis. This is the difference between an artifact that’s usable in an analytical product and one that’s discardable. Mirror to the vault Evidence pack via the osint-evidence-pack convention.
- Reproducibility: pin dependencies in
requirements.txtwith exact versions. The script that worked six months ago will not work today unless dependencies are pinned. - Idempotency: design every collection step to be safely re-runnable. Use upserts, not inserts; check existence before fetching.
Gap: Python OSINT automation is not covered by most legal frameworks’ explicit definitions of “collection” — courts and regulators have largely failed to keep pace with the techniques described in this guide. Legal exposure varies significantly by jurisdiction (CFAA in the US, the GDPR in the EU, the LGPD in Brazil, Computer Misuse Act in the UK). Scraping a public website may or may not be lawful depending on the site’s ToS, the jurisdiction, and the purpose; API access under a free tier may carry contractual restrictions that survive even if no statutory law applies. See OSINT Legal Framework for the analytical/legal framework. When in doubt, consult counsel before scaling a collection operation. (Fact: hiQ v. LinkedIn established that scraping public profiles is not a CFAA violation in the US — but the underlying state-law and contract claims survived.)
Toolkits and Frameworks Built on Python
The Python OSINT ecosystem includes mature frameworks that abstract much of the work described above. These are worth installing and reading even if your final workflow uses custom scripts — they are reference implementations of well-engineered OSINT automation.
| Tool | Function | Notes |
|---|---|---|
| theHarvester | Domain/email/host enumeration | Passive recon; queries dozens of search engines and APIs; first stop for a new domain investigation |
| Recon-ng | Modular OSINT framework | Built on Python; Metasploit-style module system; integrates with major API providers |
| Maltego TRX | Custom transform development | Python SDK for building custom Maltego transforms — bridges Python pipelines into the Maltego graph |
| Spiderfoot | Automated OSINT footprint | Web UI + API; 200+ collection modules; the closest thing to “OSINT as a service” you can self-host |
| Datasploit | Multi-source correlation | Active + passive recon; less maintained than alternatives but useful for username/domain pivots |
| OSINT-SPY | Target profiling | Username/email/IP profiling; thinner than alternatives but easy to embed in pipelines |
| Sherlock | Username enumeration across 400+ sites | The standard for username sweeps; see osint-username |
| Holehe | Email enumeration across services | Checks if an email is registered on 100+ platforms without sending reset emails |
| Maigret | Sherlock-like username sweep | More sites, slower, better reporting |
Key Connections
- OSINT — the discipline
- OSINT Toolkit Essentials — the broader toolkit this guide automates
- Maltego Guide — Maltego TRX exposes custom Python transforms for the entity-link register
- Shodan-Censys Guide — the infrastructure-OSINT APIs most commonly automated from Python
- Crypto Tracing Tools Guide — Python is the dominant language for blockchain analysis automation
- Geolocation Methodology — Python automation supports bulk reverse-image and geo-pivoting
- Network Analysis Methodology — NetworkX is the canonical implementation of the centrality measures
- Link Analysis — graph-based relationship analysis complementing NetworkX centrality
- Entity Resolution Methodology — identity disambiguation pipeline fed by Python NLP extraction
- Source Verification Framework — source-quality evaluation applied to all automated collection
- DARKINT — dark web and .onion collection methodology
- AI-Powered OSINT Tools Guide — companion guide for AI-enhanced OSINT collection
- LLM-OSINT-SOP-A2IC — LLM-augmented OSINT SOP for AI-assisted analysis pipelines
- OSINT Ethics — ethical framework for automated collection
- OSINT Legal Framework — legal context for automated collection
- Social Media Intelligence — Python is the standard for SOCMINT collection automation
- Attribution — Python pipelines support the multi-source pivots underpinning attribution analysis
Sources
- Python Software Foundation — Python 3 Documentation (python.org/doc) — High (primary)
- Shodan — API Documentation (shodan.readthedocs.io) — High (primary)
- Scrapinghub — Scrapy Documentation (docs.scrapy.org) — High (primary)
- NetworkX Developers — NetworkX Documentation (networkx.org) — High (primary)
- Microsoft — Playwright for Python (playwright.dev/python) — High (primary)
- Telethon Developers — Telethon Documentation (docs.telethon.dev) — High (primary)
- Explosion AI — spaCy Documentation (spacy.io) — High (primary)
- Singh, Justin — Hands-On Python for Cybersecurity and OSINT (Packt, 2021) — Medium (practitioner text)
- Bazzell, Michael — Open Source Intelligence Techniques, 10th ed. — Medium (practitioner reference, includes Python sections)
- SANS Institute — SEC487: Open-Source Intelligence Gathering and Analysis — High (course materials)
- HaveIBeenPwned — API v3 Specification (haveibeenpwned.com/API/v3) — High (primary)
- hiQ Labs, Inc. v. LinkedIn Corp., 9th Cir. (2019, 2022) — High (case law for US scraping legality)