A new web crawler and spidering tool from ProjectDiscovery.io has been released.
Install
katana requires Go 1.18 to install successfully.
go install github.com/projectdiscovery/katana/cmd/katana@latest
Usage
Input
$ katana -u https://web.com
$ katana -u https://web.com,https://otherweb.com
$ katana -list url_list.txt
echo https://www.web.com | katana
cat domains | httpx | katana
Crawling Mode
According to Katana’s documentation:
Standard Mode
Standard crawling modality uses the standard go HTTP library under the hood to handle HTTP requests/responses. This modality is much faster as it doesn’t have the browser overhead. Still, it analyzes HTTP responses body as is, without any javascript or DOM rendering, potentially missing post-dom-rendered endpoints or asynchronous endpoint calls that might happen in complex web applications depending, for example, on browser-specific events.
Headless Mode
Headless mode hooks internal headless calls to handle HTTP requests/responses directly within the browser context. This offers two advantages:
- The HTTP fingerprint (TLS and user agent) fully identify the client as a legitimate browser
- Better coverage since the endpoints are discovered by analyzing the standard raw response, as in the previous modality, and also the browser-rendered one with javascript enabled.
Headless crawling is optional and can be enabled using
-headless
option.
Javascript crawling can be enabled using -jc
$ katana -u https://web.com
$ katana -u https://web.com -headless
$ katana -list url_list.txt -jc
Scope Control
-fs
: Pre-defined scope field (dn,rdn,fqdn) (default «rdn»)
rdn
: crawling scoped to root domain name and all subdomains (default)fqdn
: crawling scoped to given sub(domain)dn
: crawling scoped to domain name keyword
-cs
: In scope url regex to be followed by crawler
-cos
: Out of scope url regex to be excluded by crawler
Filters
-f
option can be used to specify any of the available fields.
(url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
Here is a table with examples of each field and expected output when used:
Field | Description | Example |
url | URL Endpoint | https://admin.projectdiscovery.io/admin/login?user=admin&password=admin |
qurl | URL including query param | https://admin.projectdiscovery.io/admin/login.php?user=admin&password=admin |
qpath | Path including query param | /login?user=admin&password=admin |
path | URL Path | https://admin.projectdiscovery.io/admin/login |
fqdn | Fully Qualified Domain name | admin.projectdiscovery.io |
rdn | Root Domain name | projectdiscovery.io |
rurl | Root URL | https://admin.projectdiscovery.io |
file | Filename in URL | login.php |
key | Parameter keys in URL | user,password |
value | Parameter values in URL | admin,admin |
kv | Keys=Values in URL | user=admin&password=admin |
dir | URL Directory name | /admin/ |
udir | URL with Directory | https://admin.projectdiscovery.io/admin/ |
$ katana -u https://web.com -f qurl -silent
-em
: Will ensure to display only output containing the given extensions.-ef
: Will ensure to remove all the URLs containing the given extensions.
$ katana -u https://web.com -silent -em js,jsp,json
$ katana -u https://web.com -silent -ef css,txt,md
Rate-limits
Delay in seconds between each new request katana makes while crawling
$ katana -u https://web.com -delay 20
Number of URLs per target to fetch at the same time.
$ katana -u https://web.com -c 20
The number of targets to process at the same time from list input.
$ katana -u https://web.com -p 20
Output
$ katana -u https://web.com -o results
$ katana -u https://web.com -j -o results.json
$ katana -u https://web.com -silent
Configuration
Depth
$ katana -u https://web.com -depth 10
Timeout
$ katana -u https://web.com -timeout 20
Proxy
$ katana -u https://web.com -proxy http://127.0.0.1:8080
Know files
$ katana -u https://web.com -kf all