~8 months ago I spent couple of months trying to deal with Azure. Recently I've been asked about Azure a couple of times and tried to put together why I feel like Azure is not suitable for using as part of automation I used to work on.
Be warned that most things written here are my personal thoughts and already quite old.I got several feedbacks after this post, see the blog post comments and reddit comments.
As of the day I was dealing with Azure, the service had several bugs have not been not fixed for years. They instead released long documents for their clients to understand and to help properly workaround. Here is a quote from Azure's own cloud-services-allocation-failures page:
You may occasionally receive errors when performing these operations even before you reach the Azure subscription limits. This article explains the causes of some of the common allocation failures and suggests possible remediation. The information may also be useful when you plan the deployment of your services.
This one is simply telling me that trying to create a VM sometimes starts returning a non-temporary error, effectively means that you can't use the offering very main feature, create a new VM, unless you work around the pretty well documented bug. This error is non-deterministic unless you know about Azure internals, which is only documented in the troubleshoot document I linked above.
One of my old colleagues said this after seeing the same pattern several times: "Is this the Azure way? Write a very detailed document for a well known bug to let users to deal with it, instead of fixing the product."
Here let me tell you about what I experienced differently from AWS.
Unnecessarily complex API design and flows.
As an example, just to be able to attach a disk to a VM, one have to deal with lots of confusing terms and their Azure specific meanings: Blob (Block Blob, Page Blob), AFS, VM disk (a special kind of blob), VHD, LUN number (why do I have to care about LUN numbers?)...
Or lets take creating a VM, you first have to understand what Cloud Services and deployments are then you can create VMs, even if there are some cli tools that can create these for you, as soon as you start implementing some automation, you immediately find yourself in a position that you have to understand all these. I also remember my first time trying to deal with AWS, it felt like massive but I did't felt that desperate when I was trying to understand any AWS service for the first time. I was really frustrated when I tried to understand Azure terminology for the first time, mainly because duplicate/redundant API versions and portals I mentioned below.
API was not stable enough (it was already a 5+ year old API, since 2010), for example when I deleted a VM, even after the delete VM event reported as completed, which was taking a couple of minutes to be reported as success or failure. I can't delete the related disks for the next extra 10 minutes. This was not the only case of it's own, we hit similar problems like this one repeatedly. These cases were not documented clearly and after hitting several issues like this, our infrastructure automation was started to be polluted by this unnecessary but unavoidable checks and sleep&retry logic in many placed. This was also making debugging and troubleshooting harder in a fast paced infra environment, especially if you have many VMs being created and destroyed on the fly for different batch jobs.
Non deterministic API return codes. Some Azure API calls randomly returns 500, which is also a case for most AWS services, so not a big deal. But AWS explicitly documents these cases and covers these along with necessary retry, back-off and throttle handling mechanisms in the provided SDKs/libraries, and this is also why they always suggest to use the provided SDKs instead of implementing API clients yourself. I didn't see similar mechanism applied many places in Azure's Python SDKs (which was the one we were able to find for our automation). Our infrastructure automation code ended up being a spaghetti full of back-off and retry, logic everywhere after some stage. Since we were expecting these problems to be addressed in the provided libraries, we implemented such logic here and there, after some stage we found ourselves with a depth to be paid by us in our source code. This also could be our mistake not to study the provided libraries.
I feel like Azure API was not designed with overall architectural needs in mind. We were trying to mount external block volumes and were not able to find a proper storage driver like flocker or REX-Ray. Both were/are just running on AWS without issues, but not able to find any similar upstream tools provide similar functionality, thinking about I'm doing something wrong. I spent some time trying to find and then understand if we can implement it ourselves for our use only as a quick solution. It turned out to be buggy implementation of Azure platform, that you have to explicitly manage the order of mounted/unmounted devices. The API asks for LUN number, which effectively force one to unmount the devices in the same order they are mounted. There were some other issues in the API flow prevents this mount-the-related-blockdevice-to-a-vm-when-needed pattern to be implemented which I can't remember now on top of my head. I don't know if this is still the case, but it was around 8 moths ago.
Other one is that you can't start several machines in the same deployment. This may be a bug in the client library implementation we use (which was also provided by Azure) but we were forced to start nodes one by one in our automation. This can still be worked around but one have to design the whole automation layer with this Azure specific not-so-clearly-documented integration limits in mind. This may be my mistake to start implementing some automation without fully understanding the underlying systems but finding this kind of limits in the middle of implementation is not fun.
Other example problem was some nodes got missing suddenly(!), they never came back. We were explained by Azure support team that it "could" happen if we create more than 40 volumes in a same volume set (blob). But each nodes root directories are also blobs?!?!?!?! This effectively forced us to manually create/manage many blobs per fixed amount of nodes.
Hard to understand API(s). 2 versions, Azure Classic vs. Resource Manager. There are also 2 different portal web interfaces, Classic and New Azure portal. Both portals were not feature complete for both API versions, some features can be found on one portal cannot be found on the other, forces users to use both. This also makes understanding examples, related blog posts and documentation too hard to understand. For each resource you find you have to try to understand which version of the API is the writer is using. This is usually a time consuming guesswork since this is not obvious, you have to scan the document try to find clues to identify which one is that for.
Microsoft only tools!!! Most functionality are implemented only in powershell cli (I guess in some .NET specific technologies). We also found Python and nodejs clients, but for most functionality in their API, neither has many functionality we needed. This lack of implementation is not only in cli implementation, they are also missing in the corresponding libraries. You'll "have to" use powershell and to automate advanced stuff if you are not willing to add new functionality to the provided libraries and their cli tools and start maintaining your fork. I feel like most stuff not in the marketing materials or getting started tutorials, you'll likely forced to use Microsoft specific tools to automate.
Another example problem I experinced was on network layer. We started to migrate our infrastructure by setting up a site-to-site VPN connection between Azure and AWS. After realizing that we can't use custom routing to point a node, so this blocked us to use solutions like openswan/strongswan on a local node in a subnet. We were limited to integrations Azure provides. The documentation about VPN integrations were not clear, we decided to try our chance. It also turned out to be that after doing some network definitions, we have to download an XML modify it and upload to be able to do what we need. And this can't be automated without powershell, not from any non-powershell. This functionality was also missing in both Portals, so we can't manually do that, and also missing in Azure provided python SDK or nodejs azure cli tools. We thought that we can add this functionality to python libraries we use, we ended up not doing so because we were not able to find a sufficient documentation for corresponding API calls. A simple one-week task turned out to be a months-long hard to maintain solution, gave us headaches for each modification.