# Spider

On this page, you'll learn how to:

* Use the `tourmaline spider` command
* Manipulate the spider to accomplish specific tasks

## Basics

A web spider is a tool that starts with a single path and scans it for more paths. It then scans the found pages for paths again, and so on until all paths have been exhausted.

The `tourmaline spider` takes the target URL as it's only required/positional argument.&#x20;

```bash
tourmaline spider <URL>
```

The command also takes various optional arguments:

* `-t|--threads`: The number of parallel threads to use (defaults to 12).
* `-o|--outfile`: Path to an output dump file.
* `-d|--depth <DEPTH>`: Specifies the maximum depth for the spider to reach (defaults to -1, meaning none).
* `-r|--regex <REGEX>`: A regex all paths must fit to be added to output (paths not matching will still be added to the queue).
* `-i|--ignore <IGNORE_REGEX>`: A regex all paths must **not** fit to be add to output (paths matching will still be added to the queue.
* `--force-regex`: Specifies that any paths not fitting the `-r <REGEX>` will not be added to the queue.
* `--force-ignore`: Specifies that any paths fitting the `-i <IGNORE_REGEX>` will not be added to the queue.
* `-k|--known <KNOWN>`: A comma-seperated list of known paths for the spider to start with.
* `--known-file <KNOWN_FILE>`: The path to a file containing known paths.
* `-l|--limit <LIMIT>`:  The maxmium number of paths for the spider to return.
* `--force-limit`: Specifies that only `-l <LIMIT>` amount of paths should be scanned.

## Examples

### Regexes Example

Let's say we're enumerating `example.com`. Initially, you might run:

```
tourmaline spider example.com
```

Now imagine that this is the output you started to get:

```
https://example.com/en/ - 200 OK (20 left)
https://example.com/de/ - 200 OK (30 left)
https://example.com/ko/ - 200 OK (45 left)
...
https://example.com/en/tos - 200 OK (490 left)
https://example.com/de/tos - 200 OK (540 left)
https://example.com/ko/tos - 200 OK (780 left)
```

This means that the site has different pages for every language it supports. This is great for people trying to read what's on the site, but it's a little annoying for us. We can filter out non-english results with:

```
tourmaline spider example.com -r "/en/"
```

{% hint style="info" %}
You can can change "/en/" to the letters of any language, not just english.
{% endhint %}

However, even though our output would look like this:

```
https://example.com/en/ - 200 OK (20 left)
https://example.com/en/tos - 200 OK (490 left)
```

The spider is still using resources to scan through those pages, thus increasing the search time. We can negate this effect with:

```
tourmaline spider example.com -r "/en/" --force-regex
```

Which will make sure that only english paths are added to the queue.

### Ignore Regexes Example

Here, we'll be enumerating the made-up site `hackme.com` with:

```
tourmaline spider hackme.com
```

We just want to scout out the pages on the site and see if we get anything interesting. However, upon running the command we get:

```
https://hackme.com/ - 200 OK (190 left)
https://hackme.com/images/hacker1.png - 200 OK (189 left)
https://hackme.com/images/hacker2.png - 200 OK (188 left)
https://hackme.com/images/hacker3.png - 200 OK (187 left)
https://hackme.com/images/cool-computer.png - 200 OK (186 left)
...
```

Not only is this annoying, but it makes the search much longer than it really needs to be. We can fix this like we did in the [previous example](/tourmaline/commands/spider.md#regexes-example), just this time using an ignore regex:

```
tourmaline spider hackme.com -i ".*\.(jpg|png|webp)" --force-ignore
```

Which ensures that images won't be added to the queue.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jewels86.gitbook.io/tourmaline/commands/spider.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
