Managing such a big cluster (hundreds of vms) is not easy, and here are some of the challenges we are facing from day to day:
and one tool we heavily rely on for managing our deploy is BOSH.
BOSH provides software release which packages up all related source code, binary assets, configuration etc, In addition to releases BOSH provides a way to capture all Operating System dependencies as one image which is called Stemcell.
BOSH tool chain provides a centralized server that manages software releases, Operating System images, persistent data, and system configuration. It provides a clear and simple way of operating deployed system.
(note: If you are not familiar with BOSH, one suggestion is to go through the BOSH tutorial )
it works great for us, yet we would like to know how far we can go, we did a scaling test for BOSH. and it turns out we can scale to up to 2000 vms and we still didn't see the limit of BOSH yet!
Here is how we roll it:
Note: We put the nats job into another vm, in order to be able to scale it separately, This is not recommended, we assume we will see the CPU contention but it doesn't happened, it bring extra complexity without significant benefit, more points of failure in the control plane.
A deployment manifest snippet as below:
jobs: - name: nats templates: - name: nats release: bosh instances: 1 resource_pool: default networks: - name: default default: - dns - gateway static_ips: - 10.0.0.8 properties: nats: user: <REPLACE WITH NATS USER> password: <REPLACE WITH NATS PASSWORD> address: 10.0.0.8 port: 4222 - name: bosh templates: - name: redis release: bosh - name: powerdns release: bosh - name: director release: bosh - name: registry release: bosh - name: health_monitor release: bosh instances: 1 resource_pool: default persistent_disk: 20480 networks: - name: default default: - dns - gateway static_ips: - 10.0.0.7 - name: vip_network static_ips: - <REPLACE WITH IPs>
Create a deployment manifest for dummy BOSH release, create a dummy BOSH release and upload to the second BOSH director.
BOSH deploy, 2000 vms get created and deployed.
Make some changes to dummy release, create a new version of dummy release and do BOSH deploy again, 2000 vms get updated with the new code.
We do managed to finish the creation of 2000 vms cluster from 0, update and delete it in one day, yet we found a couple of caveats.
First issue we found is AWS doesn't offer us 2000 instances in one AZ(availability zone). we have more than 1000 instances available in one AZ and less instances available in another AZ.
This one is easily overcome by creating 2 resource pools in BOSH manifest, each resource pool utilize one AZ. Then BOSH will automatically figure out how to place the vms into 2 AZ separately.
Another difficulty we found in the deploy is it will takes a lot of time if we can only update one vm at a time. Luckily, BOSH has a max_in_flight option which can specify how many vms to update concurrently.
But we would like to push it higher see how fast we can go, we found another hard limitation in BOSH which most people don't know, which is the max_threads in BOSH director configuration.
Before we change the max_threads, even if we set max_in_flight to 256, we found only 32 vms was updating at the same time.
So we try increasing that max_threads to 256, (you need target the first BOSH and change the BOSH manifest and redeploy to make it effect), and we start seeing some error like below:
Error 100: Request limit exceeded. AWS::EC2::Errors::RequestLimitExceeded Request limit exceeded
After talking to Amazon about it, this is a general limitation in AWS EC2 and we can't easily increase it in AWS side.
So we step back a bit to find out what the biggest parameter we can use on AWS, and 50 seems to be what we can do in Max in our account.
(BTW: We also submit some feature request to BOSH so it will retry when the limit exceeded, and the story is now in the backlog
This is not an ideal speed but still something acceptable, after this tuning, we managed to deploy 2000 vms in 2 hours, update 2000 vms in next 2 hours and delete the whole deployment in next 1.5 hour.
During the whole deploy process, we observed the CPU, Network, Disk utilization in the director VM (we use c3.xlarge) are still pretty moderate, which indicate we still have headrooms, but we don't go further due to time constraints.