Spider

A tutorial for using the Tourmaline spider command.

On this page, you'll learn how to:

  • Use the tourmaline spider command

  • Manipulate the spider to accomplish specific tasks

Basics

A web spider is a tool that starts with a single path and scans it for more paths. It then scans the found pages for paths again, and so on until all paths have been exhausted.

The tourmaline spider takes the target URL as it's only required/positional argument.

tourmaline spider <URL>

The command also takes various optional arguments:

  • -t|--threads: The number of parallel threads to use (defaults to 12).

  • -o|--outfile: Path to an output dump file.

  • -d|--depth <DEPTH>: Specifies the maximum depth for the spider to reach (defaults to -1, meaning none).

  • -r|--regex <REGEX>: A regex all paths must fit to be added to output (paths not matching will still be added to the queue).

  • -i|--ignore <IGNORE_REGEX>: A regex all paths must not fit to be add to output (paths matching will still be added to the queue.

  • --force-regex: Specifies that any paths not fitting the -r <REGEX> will not be added to the queue.

  • --force-ignore: Specifies that any paths fitting the -i <IGNORE_REGEX> will not be added to the queue.

  • -k|--known <KNOWN>: A comma-seperated list of known paths for the spider to start with.

  • --known-file <KNOWN_FILE>: The path to a file containing known paths.

  • -l|--limit <LIMIT>: The maxmium number of paths for the spider to return.

  • --force-limit: Specifies that only -l <LIMIT> amount of paths should be scanned.

Examples

Regexes Example

Let's say we're enumerating example.com. Initially, you might run:

tourmaline spider example.com

Now imagine that this is the output you started to get:

https://example.com/en/ - 200 OK (20 left)
https://example.com/de/ - 200 OK (30 left)
https://example.com/ko/ - 200 OK (45 left)
...
https://example.com/en/tos - 200 OK (490 left)
https://example.com/de/tos - 200 OK (540 left)
https://example.com/ko/tos - 200 OK (780 left)

This means that the site has different pages for every language it supports. This is great for people trying to read what's on the site, but it's a little annoying for us. We can filter out non-english results with:

tourmaline spider example.com -r "/en/"

You can can change "/en/" to the letters of any language, not just english.

However, even though our output would look like this:

https://example.com/en/ - 200 OK (20 left)
https://example.com/en/tos - 200 OK (490 left)

The spider is still using resources to scan through those pages, thus increasing the search time. We can negate this effect with:

tourmaline spider example.com -r "/en/" --force-regex

Which will make sure that only english paths are added to the queue.

Ignore Regexes Example

Here, we'll be enumerating the made-up site hackme.com with:

tourmaline spider hackme.com

We just want to scout out the pages on the site and see if we get anything interesting. However, upon running the command we get:

https://hackme.com/ - 200 OK (190 left)
https://hackme.com/images/hacker1.png - 200 OK (189 left)
https://hackme.com/images/hacker2.png - 200 OK (188 left)
https://hackme.com/images/hacker3.png - 200 OK (187 left)
https://hackme.com/images/cool-computer.png - 200 OK (186 left)
...

Not only is this annoying, but it makes the search much longer than it really needs to be. We can fix this like we did in the previous example, just this time using an ignore regex:

tourmaline spider hackme.com -i ".*\.(jpg|png|webp)" --force-ignore

Which ensures that images won't be added to the queue.

Last updated