Cloud Foundry HA with NATS and other explaination (by James Bayer)

There has been another post on this previously. When running on vSphere / SAN, this is generally not an issue as we have relied on vSphere HA features for several years and it offers a robust IaaS that restarts any SPOF VMs immediately. This is how we have been running CloudFoundry.com for about 2 years without meaningful downtime related to single points of failure. If you do not have HA IaaS with resilient storage, then Cloud Foundry is not fully multi-site or HA with out of the box configurations yet. We are working on removing all single points of failure for environments like AWS that do not have the same capabilities as vSphere.
- We have recently worked on MySQL support for Cloud Controller DB, which means that when running on AWS that RDS could be used.
- There has also been some discussions about removing single points of failure in NATS recently on the GitHub issues:https://github.com/cloudfoundry/cf-release/issues/32
- Health Manager is currently a SPOF
- UAA 1.4.x (almost deployed) will support horizontal scaling

So we are actively working on this, but we do not have all the pieces finished. We will be updating thecloudfoundry.github.com docs as we get closer.


Health Manager (HM) only operates as a single node and is therefore not HA, but a CF system should operate in degraded mode. In this degraded mode the actual state of the world with respect to application state, instances, etc and the intended state of the world will drift until HM becomes available again.

NATS is still a SPOF but we have completed a bunch of work to make sure CF components behave well when it is not around. Basically, the system should operate in a degraded mode until NATS returns. Previously many components did not behave well when NATS went away. In a subsequent set of work, we will consider other HA options for NATS including things like running a Hot/Warm NATS pair and use something like VRRP to migrate the IP, clustered NATS, or other options which keep NATS available. We decided that planning to lose NATS completely was the better path now for overall system health than try to prevent something that could conceivably happen.

Examples of how this current set of NATS work affects particular CF components:

- All - should attempt to periodically reconnect to NATS instead of exiting or giving up.

- Cloud Controller - In the degraded mode when NATS is unavailable Cloud Controller API requests to make writable changes to apps don't take effect such as push (will fail to stage), scale (should take effect in CC DB but not be communicated to DEAs), delete (not sure what happens here until I try it, but I'd expect to remove the app from CC DB and have HM garbage collect the app later). Some read operations that need NATS like stats will also not work while NATS is unavailable.

- Router - when the router cannot reach NATS it will not expire the routes it knows about so existing apps will continue getting routed to. 

- Health Manager - when the HM cannot connect to NATS, start/kill commands should not be evaluated until the NATS connection can be restored.

- DEAs - when the DEA cannot connect to NATS apps should not be stopped and the DEA should not exit.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章