By using tcpdump to troubleshoot an elusive error, we uncovered a man-in-the-middle (MITM) ssh proxy installed by our information security (InfoSec) team to harden/protect a set of machines which were accessible from the internet. The ssh proxy in question was Palo Alto Network’s (PAN) Layer 7 (i.e. it worked on any port, not solely ssh’s port 22) proxy, and was discovered when we observed a failure to negotiate ciphers during the ssh key exchange.
In our team’s Concourse CI
pipelines, we create new PCF Pivotal Cloud
Foundry environments, subject them
to a rigorous battery of tests, and then destroy them. Among our tests is the
test suite (NATS or CNATS—not to be confused with the NATS messaging bus), which
cf ssh commands to test app-to-app connectivity.
The error was elusive, but inconvenient — it would cause an entire test suite to fail. Our only clue was a cryptic ssh failure:
Error opening SSH connection: ssh: handshake failed: EOF
Let’s be clear: we’re not using OpenSSH in our tests. Sure, we’re using the SSH
protocol as implemented by the Golang library, but we’re not using the command
line tool which so many of us know and love. In other words, we type
The purpose of this specialized implementation of the OpenSSH protocol is to allow users of our Pivotal Application Service (PAS) software to connect to their application, typically to debug.
Once again, though, it’s not quite OpenSSH. For one thing, our server-side binds
to port 2222, not
sshd’s 22. Also, it’s written in Golang, not C (both the
client and the server).
The problem wasn’t consistent. In fact, over the course of a 20-minute test run, it would only appear once.
It didn’t appear everywhere—one of our environments, maintained in San Francisco, seemed immune to the problem. In fact, the problem reared its ugly head only in our San Jose environments.
And, strangest of all, the problem only occurred on the first connection
attempt. The first time
cf ssh was run, it would fail, but subsequent
We attempted connecting from workstations in Palo Alto, San Francisco, and Santa Monica. The behavior remained consistent: the first attempt would fail, and the remaining would succeed.
We tried using
ssh as a client instead of
cf ssh. Same behavior: first
would fail, remainder succeed.
We tried bringing up
sshd as a server. The results surprised us: no failures.
Not one. Our
ssh-proxy failed, but
sshd didn’t — what was going on?
We knew it was time for
tcpdump. If we were going to get any further, we
needed to examine the raw packets.
tcpdumpon our Server
tcpdump on our server (the “Diego Brain”) to determine what
was happening during failed
cf ssh connections. We discovered that, from the
Diego Brain’s perspective, the user was shutting down the connection (by sending
We dug deeper — was there anything happening in the key exchange that caused the connection to shut down?
Yes, there was something happening: the client and the Diego Brain could not agree on a common set of ciphers.
These were the ciphers offered by the Diego Brain. Note that these ciphers are
the ones included in Golang’s
These were the ciphers offered by the client:
We believe that the client shut down the connection because it could not agree on a common cipher for key exchange. But the client and server were both written in Golang, so their cipher suites should be identical. In fact, both Diffie Hellman group exchange ciphers are explicitly considered to be legacy protocols by the Golang maintainers. Why was the client’s cipher suite different, and why did it include legacy protocols?
At this point we also noticed that the SSH protocol was unexpected: it was
SSH-2.0-PaloAltoNetworks_0.2. We decided to trace the packets from the client.
tcpdumpon our Client
tcpdump on our client, and attempted to connect (via
ssh, not our
custom client, not
cf ssh) to our Diego Brain. We found the unexpected SSH
SSH-2.0-PaloAltoNetworks_0.2, but this time it was our Diego
Brain presenting it:
But the SSH protocol
SSH-2.0-PaloAltoNetworks_0.2 was only presented when the
connection subsequently failed. In the diagram above, we can see that the
ostensible Diego Brain shut down the connection by sending a FIN packet (packet
19) to our client.
We contacted IOPS, the Pivotal organization which maintains the network, who explained that the firewall is configured to intercept and proxy all ssh connections originating from or terminating at the San Jose datacenter in order to prevent ssh tunnel attacks, since the San Jose environments are accessible from the internet.
Our networking model was wrong:
We concluded that our
cf ssh connection actually works this way:
Our final resolution to this issue was a workaround wherein each test suite that runs
cf ssh, we “prime the pump” by running a
cf ssh command, which we expect to fail, before running the test suite.
[timeout] The exact timeout is a little more than an hour, somewhere between 3720 and 3840 seconds.
We wrote a
to more precisely determine the timeout. As can be seen from the output below
(edited for clarity), there was no proxy attempt at 3720 seconds (
denied...), but there was at 3840 seconds (
Permission denied (password). Timeout: 3720 Connection closed by 10.195.84.17 port 2222 Timeout: 3840
Include the amount of time that the PAN firewall waits after an unsuccessful proxy attempt before triggering the next attempt.