Disaster recovery
Sometimes technology goes wrong.
This document covers some scenarios and provides a list of actions we can take to resolve them. It also contains details about informing our users when there’s a problem.
There’s no strict plan, every problem is different and we’ll need to do a different combination of things in order to resolve it.
Things that could go wrong
GitHub goes down
Symptoms
- we can’t deploy, our Docker images are built with GitHub actions and the container registry we use is on GitHub
- so long as there’s nothing we need to publish imminently, we’re ok
Actions
- wait for GitHub to come back up
DfE Sign-in goes down
Symptoms
- School users can’t log in
- Appropriate body users can’t log in
- Lead provider API unaffected
- Admin interface unaffected
Actions
- display a notification banner on the site informing users that there are problems logging in
- if it’s a prolonged outage, hide the ‘Sign in’ button
- wait for DfE Sign-in to come back up
Contact
- DfE Sign-in will have plans in place to let users know the service is down, us sending extra comms might just confuse matters
GOV.UK Notify goes down
Symptoms
- Admins can’t log in
- Schools unaffected
- Appropriate bodies unaffected
- We can’t send out any mass comms
Actions
- Wait for GOV.UK Notify to come back up
- inform the support team that Notify’s down and they won’t be able to log into RECT’s admin area
- if this happens on an important day like registration opening, we might need developers to make the changes directly rather than using the admin UI
- we can help admins log in by retreiving their OTP from ActiveJob directly, but this is insecure and should be used as a last resort
TRS API goes down
Symptoms
- no teachers can be registered by schools
- no teachers can be registered by appropriate bodies
- Lead provider API unaffected
- Admin interface unaffected
- teacher syncing will stall, jobs will queue up
Actions
- we probably want to temporarily close the app using maintenance mode because so much functionality in the app would break
Contact
- inform SITs and appropriate bodies they’ll be unable to register new teachers
- inform the support team that they might get an influx of tickets because the app is down or journeys they were mid-way through were interrupted
Azure goes down
Symptoms
- whole app is down
- databases are inaccessible
- we’re unable to retrieve list of SITs using our database or from DfE Sign-in
Actions
- wait for Azure to come back up
Contact
- let the support team know there’s likely to be an influx of tickets
We accidentally lose some data
Symptoms
- entirely depends which table/tables are affected
- some parts of the app might not work as expected
- users might report some records are missing
Actions
- this qualifies as an incident, start the incident process, the priority depends on
- the data that’s gone
- how easy it is to replace
- how many users it will affect
- stop the service do a full restore if:
- important data is missing
- we notice it’s missing quickly
- stop the service do a partial restore if:
- important data is missing
- we take longer to realise it’s missing and want to spend a bit more time carefully inserting the missing records while more new data isn’t being added
- leave the service running and do a partial restore if:
- less-important data is missing
- the data can be added back without being affected by new data being added to the service
We accidentally delete the database
Symptoms
- service is entirely broken
Actions
- this is a P1, start the incident process
- enable maintenance mode
- do a full data restore
Contact
- schools
- appropriate bodies
- lead providers
Actions we can take
Full data restore
If we need to replace the database with an older version, we’ll need to take the service down for multiple hours and inform all of our users.
It should only be done if we suffer from a catastrophic data loss.
If we need to restore the entire database we’ll want to do it from a point in time recovery (PITR), choosing the latest time where we’re sure the data is fully intact.
- follow the PITR process which will create a new PostgreSQL Server instance in Azure and make a note of the timestamp we want to use
- create a commit on a branch that uncomments production from the ‘Restore database from point in time to new database server’ workflow
- using the new branch, run the
database-restore-ptr.ymlaction againstproductionand use the timestamp we recorded in step 1 - when it’s done, use the temporary maintenance URL to log in and check the data’s present
Partial data restore
If we accidentally lose some data and want to restore it, there’s a chance that more records have been written since the loss.
This means this is a data insertion task rather than restoring a backup, so enabling the maintenance mode is optional here depending on the tables we’re restoring.
- follow the PITR process which will create a new PostgreSQL Server instance in Azure
- we’ll probably have to do some manual adjustments so copy what we need to a
local backup so we can work with it:
bash bin/konduit.sh -n cpd-production -s s189p01-cpdec2-pd-pg-pitr -x cpd-ec2-production-web -- pg_dump -F t -E utf8 -f pitr-backup.sql.tar - restore it to a local database
bash createdb pitr-restore tar -x pitr-backup.sql.tar psql pitr-restore < /tmp/pitr-backup.sql - work out what we need to restore and how best to do it, it’ll probably involve selecting some rows for re-insertion. This will be tricky if we’re merging the restored, especially if they span multiple tables.
- use the data to build a PR or ad hoc script to re-insert the data
Maintenance mode
Enabling maintenance mode stops users from accessing the application, but leaves it running with an internal URL which will be printed by the Set maintenance mode GitHub Action.
If we want to customise the text on the page, edit the templates in the repo’s
maintenance_page directory.
To enable maintenance mode:
- go to the Set maintenance mode action in the public repo on GitHub
- click ‘Run workflow’
- set the environment to ‘Production’ and leave the mode set to ‘enable’
- run the action
To disable it, follow the same steps as for enabling it but set the mode to ‘disable’.
Display a notification banner
RECT has two types of banner, incident and maintenance. They both work in the same way and are collectively referred to as notification banners.
- create a new branch with an appropriate name
- review the content in the notification banners partial and change the text if necessary
- edit the environment config and change
ENABLE_INCIDENT_BANNERorENABLE_MAINTENANCE_BANNERtotrue - commit your changes and create a pull request - once your pull request is merged and deployed the notification banner will be enabled
To remove the notification banner, create a new PR with the banner set back to
false.
Additionally, we can create a service banner on DfE Sign-in by logging into the manage service, selecting ‘Register early career teachers’, and clicking ‘Create service banner’.
This will be shown to any school or appropriate body users who log into RECT using DfE Sign-in.
Manual deployments
Any branch can be manually deployed using the manual deployment workflow.
We can also use the make commands to do it from the command line, providing a Docker image has been built and can be pulled from GitHub’s container registry.
bash
DOCKER_IMAGE=ghcr.io/dfe-digital/abc123 make production terraform-apply
Contacting users
School and appropriate body users
We don’t hold a list of people with access to register early career teachers. If we need to contact our users, we’ll need to get a list of them from DfE Sign-in.
Once we have the list we can use GOV.UK Notify to send a bulk email.
Lead providers
Lead providers can be contacted using the (private) Teams channels or using the contact list.
Processes
Point-in-time-restoring a database
A point-in-time-restore creates a copy of the database as it was at the specified time. We have a 1 week window, if data was lost more than a week ago restore a database backup instead.
- find the database in Azure Portal
- click the ‘Restore’ tab at the top of the blade
- enter a name for the backup using the established convention with a suffix
that makes it clear it’s a PITR restore, e.g.,
s189t01-cpdec2-st-pgcould be calleds189t01-cpdec2-st-pg-pitr-20260522. It won’t be around for long, but we want to be able to find it later! - select the time the restore should be taken from. This should be the latest point we are sure the data was intact
- click ‘Review and create’, confirm, and wait for the restore to happen
- when it’s restored connect to it with konduit and make sure the data is in the right state, if it’s not go back to step 2 and repeat until it is
Once the restore is done, work out what data needs to be extracted and make a plan for inserting it back into the production database.
Connecting to a database with Konduit
Konduit is a tool created by DfE that allows us to tunnel connections to Azure.
It can be installed by running make install-konduit.
Use it like this:
“`bash bin/konduit.sh -n namespace -s target-database -x deployment-to-connect-via – command-to-run
- namespace:
cpd-production - target-database: e.g.,
s189p01-cpdec2-pd-pg-pitr - deployment-to-connect-via: e.g.,
cpd-ec2-production-web - command-to-run: e.g.,
psql,pg_dumporpg_restore”`
So, the full command could look like:
bash
bin/konduit.sh -n cpd-production -s s189p01-cpdec2-pd-pg-pitr -x cpd-ec2-production-web -- psql
Start the incident process
Get all the people you think you need on a group call as quickly as you can, and follow the steps in the Schools Digital incident playbook.