Hyper Open Edge Cloud

SlapOS Design Document - Understanding SlapOS Promises

FINAL - A design document introducing Promises and on how it is used in SlapOS.
  • Last Update:2021-01-11
  • Version:001
  • Language:en

Understanding SlapOS Promises

SlapOS (introduction) is a general purpose overlay operating system for distributed POSIX infrastructures. It is based on a Master and Slave design where the Master assigns services to Slave nodes. Slave nodes in turn process the list of services using buildout and send connection and consumption information as well as their monitoring status back to the Master. This monitoring status of each services is based on Promises, which will be explained in detail in this document.

Table of Content

  • What is a Promise?
  • Adding a Promise to a Software Release
  • Monitoring Promises
  • Watchdog

What is a Promise?

This section will briefly introduce Promises and how they are used in SlapOS to monitor whether an instance is accessible or not.

Example Promise


import socket
from slapos.grid.promise import interface
from slapos.grid.promise.generic import GenericPromise
from zope.interface import implementer

@implementer(interface.IPromise)
class RunPromise(GenericPromise):
  def __init__(self, config):
    super(RunPromise, self).__init__(config)

  def sense(self):
    """
      Simply test if we can connect to specified host:port.
    """
    hostname = self.getConfig('hostname')
    port = int(self.getConfig('port'))
    addr = (hostname , port)
    try:
      socket.create_connection(addr).close()
    except (socket.herror, socket.gaierror) as e:
      self.logger.error("ERROR hostname/port (%s) is not correct: %s", addr, e)
    except (socket.error, socket.timeout) as e:
      self.logger.error("ERROR while connecting to %s: %s", addr, e)
    else:
      self.logger.info("port connection OK (%s)", addr)
  
  def anomaly(self):
    """
      There is an anomaly if last 3 senses were bad.
    """
    return self._anomaly(result_count=3, failure_amount=3)

Port Listening Promise

A Promise is a python script doing some arbitrary work and then return a promise result saying if the promise succeeded or if it has failed. A promise script can define configurations which will be used to check the state. Promises are generated during instantiation in $ instance_home/etc/plugin and then Slapgrid will run each depending on theirs configuration to know if an instance is working or not.

The most simple example of promise is "check_if_port_listening" which is trying to open a socket to an ip/port. If it works and no other promise is failing, the instance will be green. If the socket can't be created, slapgrid will raise PromiseError then reports it to the SlapOS Master and the instance will become red.

The promise system should be used on all SlapOS softwares and stacks to define as precisely as possible whether an instance is working or not.

Promise Parts

  • Promise sensor
  • Promise test
  • Promise anomaly detector

We want to promote a simple, easy and standardised way of writing promise scripts that will verify the state of the system. These scripts can be launched by slapgrid and are configurable for each Software Release. Every promise has three parts:

The promise sensor collects the value of some monitoring aspects such as "if server is supposed to be started, get the response of an http request, else return 'server stopped' and in case of timeout return empty string".

The promise test is Green if the result of the promise sensor of the previous example is not empty, else Red. This ensures that a server that is started actually responds to http requests. There is no margin of tolerance for promise tests.

The promise anomaly detector is Green if one of the three last promise sensor values was not empty, else it is red. This ensures that we call bang only if the server is really stopped, not if an Internet glitch happened.

Note: Promises are what Buildout launches at the end. They return True or False. True means that one aspect of the partition is OK. 

Adding A Promise to a Software Release

The following section will show how to add a Promise to a software release. This can either be an existing Promise from the SlapOS repository recipe folder or a new Promise written from scratch.

Adding Existing Promises to a Software Release

[promise-check-site]
recipe = slapos.cookbook:promise.plugin
eggs =
  slapos.toolbox
output = ${directory:plugins}/promise-check-mysite-status.py
module = check_site_state
config-site-url = ${publish:site-url}
config-connection-timeout = 20
config-foo = bar

A recipe slapos.cookbook:promise.plugin can be used to generate promise scripts.

To use any of the existing promises requires to add a new section to the software release profile (and don't forget to add it in the parts list, too). For example:

[promise-check-site]
recipe = slapos.cookbook:promise.plugin
eggs =
  slapos.toolbox
output = ${directory:plugins}/promise-check-mysite-status.py
# module is the promise file name (without .py) in slapos.toolbox
module = check_site_state
config-site-url = ${publish:site-url}
config-connection-timeout = 20
config-foo = bar

This will generate an script which will check will test whether the ${publish:site-url} is available and timeout after 20 seconds which will cause the promise to fail. Passing config-foo=bar gives an example of how parameters are passed to the promise.

Add New Promise to a Software Release

from slapos.grid.promise import interface
from slapos.grid.promise.generic import GenericPromise, TestResult, AnomalyResult
from zope.interface import implementer

@implementer(interface.IPromise)
class RunPromise(GenericPromise):
  def __init__(self, config):
    super(RunPromise, self).__init__(config)
    # run the promise everty 2 minutes
    self.setPeriodicity(minute=2)

  def anomaly(self):
    """
      Called to detect if there is an anomaly.
      Return AnomalyResult or TestResult object
      # When AnomalyResult has failure bang is called if another promise didn't bang
    """

    # Example
    promise_result_list = self.getLastPromiseResultList(result_count=3, only_failure=True)
    if len(promise_result_list) > 2:
      return AnomalyResult(problem=True, message=promise_result_list[0][0]['mesage'])
    return AnomalyResult(problem=False, message="")

    # It's possible to use Generic helper methods
    # return self._anomaly(result_count=3, failure_amount=3)

  def sense(self):
    """
      Run the promise code and store the result in promise log file
        raise error, log error message, ... for failure
    """

    # DO SOMETHING...
    failed = True
    raised = False
    if failed:
      self.logger.error("ERROR while checking instance http server")
    else:
      self.logger.info("http server is OK")
    if raised:
      raise ValueError("Server URL is not correct")

  def test(self):
    """
      Test promise and say if problem is detected or not
      Return TestResult object
    """

   # Example
   promise_result_list = self.getLastPromiseResultList(result_count=1)[0]
   problem = False
   message = ""
   for result in promise_result_list:
     if result['status'] == 'ERROR' and not problem:
       problem = True
     message += "\n%s" % result['message']

   return TestResult(problem=problem, messsage=message)

   # It's possible to use Generic helper methods
   # return self._test(result_count=1, failure_amount=1)

This script is an example of a Promise in python. Writing a Promise consists of defining a class called RunPromise:

class RunPromise(GenericPromise):

which inherits from the GenericPromise class inside this class defining the methods anomaly(), sense() and test().

Python promises should be placed into the folder etc/plugin of the computer partition.

sense() runs the promise code with the given parameters, collects data for the promise whenever is makes sense and appends to a log file.

test() read promise log and return TestResult object describing the actual promise state. The test method is called when Buildout processes a partition, a partition is marked as correctly processed if there is no Buildout failures and all promises test() pass.

anomaly() returns AnomalyResult object which describes the promise state. The anomaly method is called by SlapGrid when the partition is correctly processed to check if the partition has no anomaly. If AnomalyResult.hasFailed() is True, bang is called if another promise of the same instance didn't call bang.

GenericPromise

...
  @abstractmethod
  def sense(self):
    """Run the promise code and log the result"""

  def anomaly(self):
    """Called to detect if there is an anomaly which require to bang."""
    return self._anomaly()

  def test(self):
    """Test promise and say if problem is detected or not"""
    return self._test()

  def run(self, check_anomaly=False, can_bang=True):
    """
      Method called to run the Promise
      @param check_anomaly: Say if anomaly method should be called
      @param can_bang: Set to True if bang can be called, this parameter should
        be set to False if bang is already called by another promise.
    """
    ...

The GenericPromise class contain base implementation of Promise and provides a method run() which reads the option 'check_anomaly' to enforce call of anomaly() instead of test(). By default, run a promise script will call sense() to produce result and test() to check results. Option check_anomaly is used used by buildout for periodic promise check, when the partition is already well deployed.

In future, GenericPromise will be improved to provide more methods that can be used in sense() to store promise graph data. This graph data will be used on monitor interface to plot a chart of promise result progression.

Methods Available in Promise Class

...
self.getConfig(key, default=None)
self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False)
self._test(result_count=COUNT, failure_amount=XX, latest_minute=0)
self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0)
...

Promises inherit the following methods from GenericPromise:

  • self.getTitle() - returns Promise title, eg. my_promise
  • self.getName() - returns Promise (file) name, eg. my_promise.py
  • self.getPromiseFile() - returns Promise file path
  • self.getPeriodicity() - returns current Promise periodicity
  • self.getLogFile() - return path log to file
  • self.getLogFolder() - return path to monitoring logs folder
  • self.getPartitionFolder() - return base partition folder
  • self.getConfig(key, default=None) - return configuration sent to Promise class
    Default configuration keys availble are: partition-id, computer-id, partition-key, partition-cert and master-url, slapgrid-version.
  • self.setConfig(key, value) - register a new configuration
  • self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False) - read the promise log result group from the latest promise execution specified by COUNT. Set latest_minute to specify the maximum promise execution time to search. If only_failure is True, will only get failure messages.
  • self._test(result_count=COUNT, failure_amount=XX, latest_minute=0) - return TestResult from latest Promise result
  • self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0) - return AnomalyResult from latest Promise result

These inherited methods should be called promise in __init__() after the line "GenericPromise.__init__(self, config)":

  • self.setPeriodicity(minute=XX) - change the default  periodicity to check promise anomaly
  • self.setTestLess() - disable promise test call, this promise will be called only to check anomaly
  • self.setAnomalyLess() - disabled promise anomaly call, this promise will be called only to check test (when buildout is deploying the partition).

Note: if Anomaly and Test are disabled, promise will raise because promise cannot check nothing.

In your promise code, you will be able to call self.getConfig("site-url"), self.getConfig("connection-timeout") and self.getConfig("foo"). The returned value of self.getConfig(KEY) is None if the config parameter KEY is not set.

Developing Python Promises

from slapos.promise.plugin.check_site_state import RunPromise

Promise code must be committed to the slapos.toolbox repository. Please put your promise into the folder slapos/promise/plugin, so you can import them in a file in etc/plugin folder of your instance.

For debugging, the monitor promise script added by monitor can be used to test promises execution without using slapgrid. The script will be exposed in the bin/ directory of the software release.

You can run a promise, using:

SR_DIRECTORY/bin/monitor.runpromise --config etc/monitor.conf --console --dry-run [ARG, ...]

Note, that legacy promises are promises placed in PARTITION_DIRECTORY/etc/promise, they can be bash or others executable scripts. The promise launcher will use a special wrapper to call them as a subprocess, the success or failure state will be based on the process return code (0 = sucess, > 1 = failure).

To set the frequency of buildout runs, the software release should write a file periodicity into software release folder which contains the time period in seconds, eg. to process the partition every 12 hours, the file /opt/slapgrid/SR_MD5SUM/periodicity should contain 43200= 12h

Monitoring Promises

This section covers monitoring of partitions along with goals of running Promises correctly as well as things to avoid.

Controlling Partition Status

  • Periodic Instantiation
  • Periodic Promise sensors
  • Bang

In normal conditions:

  • Instantiation runs periodically (at least once in an interval of computer configurable frequency which is usually 24 hours), running promises and posting to master, hence showing signs of life.
  • Slapgrid runs periodically a set of promise sensors, and upon anomaly detection on the promise sensor value, bang is called on the partition.
  • Upon call of bang, a run of partition instantiation is scheduled by SlapOS Master on all partitions that belong to the same software instance tree.

Running buildout on all partitions after a bang is supposed to converge to a stable state with all promises passing.

Slapgrid is configured to run promises at some interval of time which can be configured differently for each promise sensor (see before). SlapOS knows nothing about the results of running promise sensors. The only thing the Master knows is that a bang was issued due to anomaly detection.

Monitoring Goals

  • Servers are alive
  • Partitions are fulfilling all promises

The goal of monitoring is to provide good quality of services by knowing problems before customer tells us. This is done by ensuring that servers are alive and partitions are fulfilling all promises.

Alive servers

Servers should contact master periodically to notify that they are alive. The master will show the state of each server according to a colour. A server is Green if it contacted the master within the last 5 minutes. If it contacted the master within the last hour 1 hour, the server is Orange else it's Red. From a monitoring point of view, the server conctacts the master whenever Slapgrid connects to slapOS master, no matter what for.

 

Fulfilled promises

The master shows the state of each requested partition according to a colour. A partition is Green if the latest result sent by Slapgrid for that partition is OK (meaning that all promises succeeded and there were no other failures) and if that message was sent less than one day ago and less than the buildout run frequency defined by the software release and if no bang was trigered after that. Else the partition is Red.

Note 1: Buildout on a partition in SlapOS will be executed at least once per computer configurable frequency (usually one day) and at least once per software release configurable frequency (seldom configured).

Note 2: the computer configurable frequency of Buildout run must be stored on the Computer in SlapOS master at registration time and updated, else it is impossible to check promise fulfillment.

Monitoring Crimes

  • Buildout runs all the time without ever going to sleep
  • Run all promises every minute
  • Always falling promises
  • Buildout taking too long to process a computer partition

There are four monitoring crimes that every developer should keep in mind:

  • Buildout runs all the time without ever going to sleep
    If Buildout runs all then time too much resources are consumed which can overload the server. One should care to so that all promises of the Software Release can be solved.
  • Run all promises every minutes
    It's not required to run all promises in monitor every minute, instead they should be configurable, the frequency should be set for each promise.
  • Always falling promises
    If a promise never reaches the stage that it passes, it means that the SR is badly implemented and should be reviewed.
  • Buildout taking too long to process a computer partition
    Buildout should process a computer partition in a short time, else it prevents ensuring reponsive provisionning of other paritions. The time to process a computer partition should be less that one minute.

Watchdog

This section introduces the "Watchdog", a process that is monitoring other processes and can call "bang" to the Master.

Watchdog Explained

Watchdog is a simple SlapOS Node feature allowing to watch any process managed by supervisord. All processes scripts into PARTITION_DIRECTORY/etc/service directory are watched. They are automatically configured in supervisord with an added on-watch suffix on their process_name. Whenever one of them exits, watchdog will trigger an alert (bang) that is sent to the Master. Bang will force SlapGrid to reprocess all instances of the service. This also forces recheck of all promises and post the result to master, letting the master decide whether the partition state is Green or Red.

Bang

  • Called explicitly (eg. by a Promise or a Service)
  • Called implicitly when a process watched by Watchgod changes to an unsupposed state

Bang should be called as much as needed in a day by a partition. There should not be a limitation in number of calls else it's not possible to adapt dynamically. A Master protection against recurring bang calls should be considered using a kind of quota per day, that might depend on price or defined into the software release. if the bang quota of the day is reached, the master will reject all future calls until the next day.

As a bang will trigger a run of Buildout, Buildout, in theory, is run all the time repeatedly. This is why it is supposed to have 0 execution time (theoretical model). But since that would take 100% of CPU, we have to call it less often. So, we find ways to call it less often:

  • every X (this can be configured at the profile level)
  • if promises are not all satisfied
  • if requested services are not available
  • as the result of bang

Buildout is actually called by SlapGrid. SlapGrid itself is called every Y (in theory, Y = 0, but in reality 1 minute). So, SlapGrid is called:

  • at least every minute
  • right after a SlapGrid call if something happened in the previous call (eg. request of new service, failing Promise) with an increasing delay to reduce CPU load

Currently bang has to go through the master. It is possible in future to consider a short cut that does not go through the master. But it is probably simpler and cleaner to run SlapProxy locally if one needs full autonomy.

Thank You

Image Nexedi Office
  • Nexedi SA
  • 147 Rue du Ballon
  • 59110 La Madeleine
  • France