Discussion:
Host timeouts on large SYN scans
p***@cea.fr
2012-09-21 11:03:59 UTC
Permalink
Hi list,

I run scans on a LAN / fast WAN with "-iL -" and an external script
feeds the targets in a random order when select() says nmap's stdin is
ready.

"--host-timeout" is set to "15m" or "60m"

After running a Ping scan against 4096 targets, nmap runs a succession
of (SYN, Service, Traceroute, NSE) scans against "hostgroups".

The first hostgroup has 4 hosts, and all the next ones have 64 (or
whatever specified with --max-hostgroup). At first everything works pretty
fine, but after a few hostgroups (5 to 10), all the "SYN Stealth Scan"
task end with "XX hosts timed out" (with XX the size of the hostgroup).

This is 100% reproducible, and happens with the SVN version. I didn't
try with the stable version yet.

I've then tried some tweaks like "--nogcc" or specifying a
"--min-parallelism" value (e.g. 30) with no different result.

I have no rate-limiting device between the targets and the scan machine.

Do you have an idea of what I could do to try to debug (or fix !) this
issue?

Regards,

Pierre
David Fifield
2012-09-21 15:36:29 UTC
Permalink
Post by p***@cea.fr
I run scans on a LAN / fast WAN with "-iL -" and an external script
feeds the targets in a random order when select() says nmap's stdin
is ready.
"--host-timeout" is set to "15m" or "60m"
After running a Ping scan against 4096 targets, nmap runs a
succession of (SYN, Service, Traceroute, NSE) scans against
"hostgroups".
The first hostgroup has 4 hosts, and all the next ones have 64 (or
whatever specified with --max-hostgroup). At first everything works pretty
fine, but after a few hostgroups (5 to 10), all the "SYN Stealth
Scan" task end with "XX hosts timed out" (with XX the size of the
hostgroup).
When does this happen? Is it 15m after the start of the entire scan, or
15m after the start of the hostgroup? Or something else?

Sometimes SYN scans can go slowly enough that they reach a host timeout.
You can try the option --defeat-rst-ratelimit as RST rate limiting is
the msot likely thing to severely slow down a SYN scan. It's strange
that it happens to the whole group at once, though. It might be a bug
with stopping and restarting the timeout timers.

Can you show us the rest of the command line you are using?

Does this same thing happen if you write 4096 IP addresses in random
order to a file, and then read -iL from that file?

David Fifield
Henri Doreau
2012-09-22 19:25:43 UTC
Permalink
Hello,

I work with Pierre and we've been troubleshooting (in vain, so far)
this issue together. Some answers inline below.
Post by David Fifield
When does this happen? Is it 15m after the start of the entire scan, or
15m after the start of the hostgroup? Or something else?
This happens 15m after the start of the hostgroup.
Post by David Fifield
Sometimes SYN scans can go slowly enough that they reach a host timeout.
You can try the option --defeat-rst-ratelimit as RST rate limiting is
the msot likely thing to severely slow down a SYN scan. It's strange
that it happens to the whole group at once, though. It might be a bug
with stopping and restarting the timeout timers.
We didn't try --defeat-rst-ratelimit, but also we know that there's no
device rate-limiting. Targets are supplied in a random order, first
hostgroups always succeed, next ones always fail. If you stop and
restart immediately you get the same behavior again. Because of this
we kind of excluded a network-related issue.
Post by David Fifield
Can you show us the rest of the command line you are using?
I let Pierre answer this, but this we could also observe without any
timing/performance related parameter.
Post by David Fifield
Does this same thing happen if you write 4096 IP addresses in random
order to a file, and then read -iL from that file?
This is one of the next things we wanted to try, not done yet.

Running a single nmap instance per target (XX being launched in
parallel, with XX the hostgroup size we use now) works fine.

Regards
--
Henri
David Fifield
2012-09-22 19:32:45 UTC
Permalink
Post by Henri Doreau
Post by David Fifield
When does this happen? Is it 15m after the start of the entire scan, or
15m after the start of the hostgroup? Or something else?
This happens 15m after the start of the hostgroup.
So the problem is the SYN scan for some reason going slowly, not a bug
in timeout calculation?

David Fifield
p***@cea.fr
2012-09-24 13:54:07 UTC
Permalink
Hello

I've just noticed that this behaviour also affects the next ping scan
and continues after it :

$ grep -E '<task(begin|end) task="(SYN|Ping)' log.xml
<taskbegin task="Ping Scan" time="1348472434"/>
<taskend task="Ping Scan" time="1348472973" extrainfo="4096 total hosts"/>
<taskbegin task="SYN Stealth Scan" time="1348472973"/>
<taskend task="SYN Stealth Scan" time="1348472975" extrainfo="4000 total
ports"/>
<taskbegin task="SYN Stealth Scan" time="1348473114"/>
<taskend task="SYN Stealth Scan" time="1348473539" extrainfo="1 host
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348475230"/>
<taskend task="SYN Stealth Scan" time="1348475300" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348476424"/>
<taskend task="SYN Stealth Scan" time="1348476451" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348477681"/>
<taskend task="SYN Stealth Scan" time="1348477703" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348478885"/>
<taskend task="SYN Stealth Scan" time="1348478963" extrainfo="64000
total ports"/>
<taskbegin task="Ping Scan" time="1348480060"/>
<taskend task="Ping Scan" time="1348480930" extrainfo="4096 total hosts"/>
<taskbegin task="SYN Stealth Scan" time="1348480930"/>
<taskend task="SYN Stealth Scan" time="1348480953" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348482185"/>
<taskend task="SYN Stealth Scan" time="1348482205" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348483324"/>
<taskend task="SYN Stealth Scan" time="1348484218" extrainfo="1 host
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348485433"/>
<taskend task="SYN Stealth Scan" time="1348486366" extrainfo="64 hosts
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348486622"/>
<taskend task="SYN Stealth Scan" time="1348487498" extrainfo="64 hosts
timed out"/>
<taskbegin task="Ping Scan" time="1348489421"/>
<taskend task="Ping Scan" time="1348490397" extrainfo="3347 hosts timed
out"/>
<taskbegin task="SYN Stealth Scan" time="1348490397"/>
<taskend task="SYN Stealth Scan" time="1348491398" extrainfo="64 hosts
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348492921"/>
<taskend task="SYN Stealth Scan" time="1348493798" extrainfo="64 hosts
timed out"/>

Pierre
p***@cea.fr
2012-09-24 12:31:54 UTC
Permalink
Hello,
Post by Henri Doreau
Post by David Fifield
When does this happen? Is it 15m after the start of the entire scan, or
15m after the start of the hostgroup? Or something else?
This happens 15m after the start of the hostgroup.
Exactly, or whatever the --host-timeout value is.
Post by Henri Doreau
Post by David Fifield
Can you show us the rest of the command line you are using?
I let Pierre answer this, but this we could also observe without any
timing/performance related parameter.
Sure, we use :
-vv -n -oX log.xml -iL - -PS -PA -PU -PE -PP -PO -sS -A --host-timeout 15m

We've also tried --host-timeout 60m with the same results.
Post by Henri Doreau
Post by David Fifield
Does this same thing happen if you write 4096 IP addresses in random
order to a file, and then read -iL from that file?
This is one of the next things we wanted to try, not done yet.
This does exactly the same. So replacing "-iL -" with "-iL alltargets"
(~20k lines) gives the same result :

$ grep -E '<task(begin|end) task="SYN' log.xml
<taskbegin task="SYN Stealth Scan" time="1348472973"/>
<taskend task="SYN Stealth Scan" time="1348472975" extrainfo="4000 total
ports"/>
<taskbegin task="SYN Stealth Scan" time="1348473114"/>
<taskend task="SYN Stealth Scan" time="1348473539" extrainfo="1 host
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348475230"/>
<taskend task="SYN Stealth Scan" time="1348475300" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348476424"/>
<taskend task="SYN Stealth Scan" time="1348476451" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348477681"/>
<taskend task="SYN Stealth Scan" time="1348477703" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348478885"/>
<taskend task="SYN Stealth Scan" time="1348478963" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348480930"/>
<taskend task="SYN Stealth Scan" time="1348480953" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348482185"/>
<taskend task="SYN Stealth Scan" time="1348482205" extrainfo="64000
total ports"/>
<taskbegin task="SYN Stealth Scan" time="1348483324"/>
<taskend task="SYN Stealth Scan" time="1348484218" extrainfo="1 host
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348485433"/>
<taskend task="SYN Stealth Scan" time="1348486366" extrainfo="64 hosts
timed out"/>
<taskbegin task="SYN Stealth Scan" time="1348486622"/>
<taskend task="SYN Stealth Scan" time="1348487498" extrainfo="64 hosts
timed out"/>

I don't always get these "1 host timed out" messages, and I think they
are OK (i.e., some hosts really need more than 15m to get scanned).

Regards,

Pierre
David Fifield
2012-10-03 15:31:35 UTC
Permalink
Post by Henri Doreau
Hello,
Post by Henri Doreau
Post by David Fifield
When does this happen? Is it 15m after the start of the entire scan, or
15m after the start of the hostgroup? Or something else?
This happens 15m after the start of the hostgroup.
Exactly, or whatever the --host-timeout value is.
Post by Henri Doreau
Post by David Fifield
Can you show us the rest of the command line you are using?
I let Pierre answer this, but this we could also observe without any
timing/performance related parameter.
-vv -n -oX log.xml -iL - -PS -PA -PU -PE -PP -PO -sS -A --host-timeout 15m
We've also tried --host-timeout 60m with the same results.
Pierre kindly bisected this problem, and encoutnered some difficulties,
but the evidence indicates that this is a new problem, and that the
revision that introduced it is somewhere between r28648 and r29507.

The only substantive change to scan_engine.cc in that range is r29051.
Grepping for "timeout" finds r29215, which affects timeouts for NSE
worker threads, however Pierre said that the problem is not with
inaccurate timeout calculations, but with SYN scans becoming slow.

David Fifield

Loading...