Self healing infrastructure is something that has always piqued my interested. The first iteration of self healing infrastructure that I came across was the Solaris Service Management Facility aka “SMF”. SMF would restart services if they crashed due to hardware errors or general errors outside of the service itself.
For today's article we are going to explore another way of creating a self healing environment; going beyond restarting failed services. In today's article we are going to take a snippet of code that connects to a database service and give that application not only the ability to reconnect during database failure but also give it the ability to automatically resolve the database issues.
Starting with a simple connection
For today's article we are going to take a snippet of code from an existing applicaton and give it self healing super powers. The code we are using is from Runbook a side project of mine that does all sorts of cool automation for DevOps.
# RethinkDB Server
try:
rdb_server = r.connect(
host=config['rethink_host'], port=config['rethink_port'],
auth_key=config['rethink_authkey'], db=config['rethink_db'])
print("Connected to RethinkDB")
except (RqlDriverError, socket.error) as e:
print("Cannot connect to rethinkdb, shutting down")
print("RethinkDB Error: %s") % e.message
sys.exit(1)
This code has been altered a bit for simplification.
The code above will attempt to connect to a RethinkDB instance. If successful it creates a connection object rdb_server
which can be used later for running queries against the database. If the connection is not successful the application will log an error and exit with an exit code of 1
.
To put it simply, if RethinkDB is down or not accepting connections this process stops.
Let's try again
Before we start adding super powers we need to change how the application handles connection errors. Right now it simply exits the process and unless we have external systems restarting the process it never attempts to reconnect. For a self healing application we should change this behavior to have the application reattempt connections until RethinkDB is online.
# Set initial values
connection_attempts = 0
connected = False
# Retry RethinkDB Connections until successful
while connected == False:
# RethinkDB Server
try:
rdb_server = r.connect(
host=config['rethink_host'], port=config['rethink_port'],
auth_key=config['rethink_authkey'], db=config['rethink_db'])
connected = True
print("Connected to RethinkDB")
except (RqlDriverError, socket.error) as e:
print("Cannot connect to rethinkdb")
print("RethinkDB Error: %s") % e.message
connection_attempts = connection_attempts + 1
print("RethinkDB connection attempts: %d") % connection_attempts
If we breakdown the above code we can see that we added two new variables and a while loop. The above code will simply retry connecting to RethinkDB until successful. In some ways this in itself is making the application self healing, as it is gracefully handling an error with an external system and keeps trying to reconnect. These however are not the super powers I was referring to.
Giving our application superpowers via Saltstack
In an earlier article I covered implementing salt-api the API for Saltstack. While that article covered utilizing salt-api with third party services such as Runbook or Datadog; that same level of integration could be added to applications themselves. Giving those applications the ability to run infrastructure tasks.
Using Salt-API and Reactor Formula
For sake of brevity this article will assume that you already have Saltstack and salt-api installed and configured to accept webhook requests as outlined in the previous article. For this article we will also be utilizing a salt-api and reactor formula that I created for Runbook.
This formula provides several template reactor configurations that can be used to pickup salt-api webhook requests and perform salt actions. Actions such as restarting services, executing shell commands, or even start a highstate. To get started we will first need to download and extract the formula.
# wget -O /var/tmp/master.zip https://github.com/madflojo/salt-api-reactor-formula/archive/master.zip
# cd /var/tmp/
# unzip master.zip
Once extracted we can copy the reactor directory to /srv/salt/
, this is the default salt directory and may need to be updated for your environment.
# cp -R salt-api-reactor-formula-master/reactor /srv/salt/
We will also need to deploy our reactor config to the /etc/salt/master.d/
directory as this is what maps the URL endpoint to a specific salt action. Once deployed we will also need to restart the salt-master
service.
# cp salt-api-reactor-formula-master/reactor.conf /etc/salt/master.d/
# service salt-master restart
Examining a reactor configuration
When our application is unable to connect to RethinkDB we want to perform some sort of corrective task. The easiest and safest thing to do in Runbook's environment is to simply run a salt highstate. A highstate execution will tell Saltstack to go through all of the defined configurations and make them true on the desired minion server. In our environment that includes ensuring the RethinkDB service is running and configured.
If our application is able to call a highstate execution on the database hosts there is a good chance that the issue may be corrected. Giving our application the ability to resolve any issue that was caused by RethinkDB not matching our desired state.
highstate.sls
In order to give our application the ability to run a highstate we will utilize the reactor/states/highstate.sls
formula. Before going further we should first examine how this formula works.
{% set postdata = data.get('post', {}) %}
{% if postdata.secretkey == "PICKSOMETHINGBETTERPLZKTHX" %}
state_highstate:
cmd.state.highstate:
- tgt: '{{ postdata.tgt }}'
{% if "matcher" in postdata %}
- expr_form: {{ postdata.matcher }}
{% endif %}
{% if "args" in postdata %}
- arg:
- {{ postdata.args }}
{% endif %}
{% endif %}
When a POST
request is made to the http://saltapiurl/webhooks/states/highstate
address salt-api will take the POST
data of that request and pass it along salts event system. When processed this reactor configuration will take the POST
data and assign it to a dictionary named postdata
. From there salt will check for a key in the postdata
dictionary named secretkey
and ensure that the value of that key matches the defined “secretkey” in the template. This section is used to act as an authentication method for webhooks.
Each reactor template has an example secret key defined, it is recommended that you modify this to a unique value for your environment.
After validation salt will look for additional keys in the postdata
dictionary, for our purpose we will need to understand the tgt
and matcher
keys. The tgt
key is used to specify the “target” for the highstate execution. This target can be a hostname, a grain value, pillar value, subnet or any other target Saltstack accepts. The matcher
key contains a definition of the tgt
keys expression, for instance if the tgt
value was a hostname, the matcher
value should be glob
for a hostname glob. If the tgt
value was a pillar value, the matcher
value should be pillar
. You can find all of the valid matcher values in salt-api's documentation.
Calling salt-api
Now that we have salt-api configured to accept webhook requests and start highstate executions, we now need to code our application to call those webhooks. Since this is something we may want to do somewhat often in our code we can create a function to perform this webhook request.
Highstate Function
def callSaltHighstate(config):
''' Call Saltstack to initiate a highstate '''
import requests
url = config['salt_url'] + "/states/highstate"
headers = {
"Accept:" : "application/json"
}
postdata = {
"tgt" : "db*",
"matcher" : "glob",
"secretkey" : config['salt_key']
}
try:
req = requests.post(url=url, headers=headers, data=postdata, verify=False)
print("Called for help and got response code: %d") % req.status_code
if req.status_code == 200:
return True
else:
return False
except (requests.exceptions.RequestException) as e:
print("Error calling for help: %s") % e.message
return False
The code above is pretty simple, it essentially performs an HTTP POST
request with POST
data fields tgt
, matcher
and secretkey
. The tgt
field contains db*
which in our field is a hostname glob that matches our database server names. The matcher
value is glob
to denote that the tgt
value is a hostname glob value. The secretkey
actually contains the value of config['salt_key']
which is pulled from our configuration file when the main process starts and is passed to the callSaltHighstate()
function.
Now that the code to call salt-api is defined we can add the callSaltHighstate()
function into the exception handling for RethinkDB.
Adding callSaltHighstate as an action
# Set initial values
connection_attempts = 0
connected = False
# Retry RethinkDB Connections until successful
while connected == False:
# RethinkDB Server
try:
rdb_server = r.connect(
host=config['rethink_host'], port=config['rethink_port'],
auth_key=config['rethink_authkey'], db=config['rethink_db'])
connected = True
print("Connected to RethinkDB")
except (RqlDriverError, socket.error) as e:
print("Cannot connect to rethinkdb")
print("RethinkDB Error: %s") % e.message
callSaltHighstate(config)
connection_attempts = connection_attempts + 1
print("RethinkDB connection attempts: %d") % connection_attempts
As you can see the code above hasn't changed much from the previous example. The biggest change is that after printing the RethinkDB error we experienced we then execute the callSaltHighstate()
function.
Leveling up
For a simple example the above code works quite well, however there is a bit of a flaw. With the above code a highstate will be called every time the application attempts to connect to RethinkDB and fails. Since a highstate will take a bit of time to execute this could cause a backlog of highstate executions which could in theory cause even more issues.
To combat this at the end of the while loop you could add a time.sleep(120)
to cause the application to sleep for 120 seconds between each while loop executions. This would give Saltstack some time to execute the highstate before another is queued. While a sleep would work and is simple, it is not the most elegant method.
Since we can call Saltstack to perform essentially any task Saltstack can perform. Why stop at just a highstate? Below we are going to create another function that calls salt-api, but rather than run a highstate this function will send a webhook request that tells salt-api to restart the RethinkDB service.
Restart function
def callSaltRestart(config):
''' Call Saltstack to restart a service '''
import requests
url = config['salt_url'] + "/services/restart"
headers = {
"Accept:" : "application/json"
}
postdata = {
"tgt" : "db*",
"matcher" : "glob",
"args" : "rethinkdb",
"secretkey" : config['salt_key']
}
try:
req = requests.post(url=url, headers=headers, data=postdata, verify=False)
print("Called for help and got response code: %d") % req.status_code
if req.status_code == 200:
return True
else:
return False
except (requests.exceptions.RequestException) as e:
print("Error calling for help: %s") % e.message
return False
The above code is very similar to the highstate function with the exception that the URL endpoint has changed to /services/restart
(which utilizes the reactor/services/restart.sls
template) and there is a new POST
data key called args
which contains rethinkdb
the service in which we want to restart.
Since we are adding the complexity of restarting the RethinkDB service we want to make sure that this call is not made too often. At the moment the best way to do this is to build that logic into the application itself.
Extending when to call salt-api
# Set initial values
connection_attempts = 0
first_connect = 0.00
last_restart = 0.00
last_highstate = 0.00
connected = False
called = None
# Retry RethinkDB Connections until successful
while connected == False:
if first_connect == 0.00:
first_connect = time.time()
# RethinkDB Server
try:
rdb_server = r.connect(
host=config['rethink_host'], port=config['rethink_port'],
auth_key=config['rethink_authkey'], db=config['rethink_db'])
connected = True
print("Connected to RethinkDB")
except (RqlDriverError, socket.error) as e:
print("Cannot connect to rethinkdb")
print("RethinkDB Error: %s") % e.message
timediff = time.time() - first_connect
if timediff > 300.00:
last_timediff = time.time() - last_restart
if last_timediff > 600.00 or last_restart == 0.00:
if timediff > 600:
callSaltRestart(config)
last_restart = time.time()
last_timediff = time.time() - last_highstate
if last_timediff > 300.00 or last_highstate == 0.00:
callSaltHighstate(config)
last_highstate = time.time()
connection_attempts = connection_attempts + 1
print("RethinkDB connection attempts: %d") % connection_attempts
time.sleep(60)
As you can see with the above code we added quite a bit of logic around when to run and when not to run. With the above code, when our application is unable to connect to RethinkDB it will keep retrying until successful, just as before. However, every 5 minutes if the application is unable to connect to RethinkDB it will call Saltstack via salt-api requesting a highstate be executed on the database servers. Every 10 minutes, if RethinkDB is still not accessible this application will call Saltstack via salt-api requesting a restart of the RethinkDB service on all database servers.
Improvements
With today's example we are able to correct situations that many applications cannot. Being able to restart the database when you are unable to connect to it is a good example of a self healing environment. However, there are more things that could be done to this application.
This same type of logic could be built into Query exceptions rather than Connection exceptions only. With query exceptions you could also use salt-api to execute database maintenance scripts or call salt-cloud to provision additional servers. Once you give your application the ability to perform infrastructure wide actions you open the door to a wide range of automation capabilities.
To see the full script from this example you can view it on this GitHub Gist