After having my own bare-metal cluster at home, I will try to build a sample system just for fun with overall goals:
* minimum dependency on any existing *aaS, and make the system self contained for each piece,
* prevent any kind of vendor-lock-in,
* make any part of system HA (dc-wide only),
* zero touch ops, allow system to detect and auto-heal probles,
* ready to use infra services to cover most usual scenarios like ELK, CI/CD, Monitoring/Alarming
* (scope is intra-dc, so there will be no inter-dc solutions)
To achieve these goals, I'm thinking about using following components for my own reference implementation. Before , I played with mesos/docker on CoreOS/Ubuntu on some cloud providers, also tried some DCOS. And I ended up with a Mesos installation on Ubuntu just because I still feel myself more comfortable on Ubuntu while playing around.
Cluster Level Service Management
This was already my starting point for the whole idea. Mesos + Marathon + Chronos will likely do the job.
Thanks to guys from Mesosphere, we could get some well defined service discovery solutions without a lot of effort. We have mesos-dns (gives us also SRV records) and haproxy-marathon-bridge (will likely replaced by servicerouter.py) nearly for free. Consul could also help but it needs some not-ready-yet dependencies like an etcd cluster setup.
Creating an OpenVPN service seems fairly simple with marathon+docker, thanks to Kyle Manna.
I will likely use some kind of git repository just for not to depend on github, most likely GitLab, but GitLab is depending on many other components that each should be carefully architectured HA with auto-heal in mind, I did not look into this yet.
Seems like using docker with a self-hosted docker-registry could help a lot with deployment problems. And for building and testing, there is a Jenkins integration with mesos but we need to overcome some set of problems cleaning docker leftovers around. I will try to write about this later, but with the current state of docker problems it could not be suitable running docker builds on random nodes.
We need a kind of central log system, ELK stack for identifying production problems easier. There seems like infinite number of log related tools to use with docker. But for me, this hits the lack of persistent-storage problem in current mesos releases for now, elasticsearch framework could help but I'm just not there yet.
Monitoring in general is an already solved problem. but there are no best practices around about how and what to monitor for Mesos/Marathon/Chronos itself since most of the parts of the system will be HA with auto-heal support, hopefully no extensive alarming/notification systems will be needed for most of the components. Some monitoring/alarming tools like Prometheus (since it also has mesos-exporter), and Satellite seems promising. But both does not seem like mature enough yet. Seems like most Mesos users have their own customized systems for their monitoring infrastructure.
For now I will leave any kind of persistent-storage dependent components to a later phase (Mesos will likely have persistent-storage support soon), I will focus on these components as a first step:
* marathon - service scheduler for ephemeral and/or idempotent tasks
* chronos - cluster-wide chron jobs
* haproxy-marathon-bridge - for HA and service discovery
* mesos-dns - service discovery using DNS, also generated SRV records
* openvpn - so that i can reach my cluster from everywhere
* docker-registry - my own docker registry, I'll use S3 to store images for now to avoid persistent-storage needs
For now I'll be missing these in my setup:
* any kind of persistence, big lack in the whole stack, there is no best practices around yet to run services need some kind of persistence
* disaster recovery on inter-dc/inter-region level
* DCOS - which have its own package support for easy installing frameworks, seems like mostly helping with frameworks for persistency related framework installations, I'm not able to use DCOS since it seems like they only support AWS for now
* any kind of security between different components, will do the mistake of assuming local-network safe