I was called at 3 am today. Great what did we break now!? Well, nothing seems to be working said the engineer who escalated the problem to me. We found that our monitoring system for our many customers was not working. The reason AWS Beijing is down. Well they will say something different, as of now 12:06 PM they have this explanation up. Well they have more explanations, but this is the core issue of it.
This is BAD stuff. If your web interface does not work, it does not matter if this is JUST one Availability zone. I mean for instance the server instances could not be displayed among other issues. Aside we have multi Availability Zone RDS machines that are also not working (RDS is their managed DB service), if this affected one availability zone they should have had a failover to the working Availability Zone. Why have they not failed over? As a result, the company I work for has has had the customers breaking the phone since then, I would guess the same happened for other customers.
Continue reading