Responding to alerts and incidents
This section is a quick reference version of the full incident management document.
Types of alert and where you’ll see them
Alerts come from different sources.
Sentry alerts
Emails are sent to the GovWifi developers mailbox. Notifications are sent to the #govwifi-monitoring Slack channel.
AWS Cloudwatch alerts
Emails are sent to GovWifi critical alerts mailbox. Notifications are sent to the #govwifi-monitoring Slack channel.
StatusPage alerts
Emails are sent to the GovWifi support mailbox. Notifications are sent to the #govwifi Slack channel.
PaaS alerts
Emails are sent to the GovWifi support mailbox.
Notify alerts
Emails are sent to the GovWifi support mailbox and StatusPage.
Tell people if you see an alert
- Share a brief summary of what you’ve seen on the #govwifi Slack channel.
- Find out if anyone else is already investigating the issue.
- If the issue is security related, tell the Cyber Security Team. You can use the #cyber-security-help Slack channel.
Appoint an incident lead
Make sure the product manager, delivery manager or tech lead is aware of the incident.
One of them will be the ‘incident lead’.
Categorise the incident
The tech lead should lead the categorisation of the incident and discuss it with the relevant people.
P1 incidents (critical)
These are situations where there’s:
- a complete outage
- unauthorised access
The main criteria is that registered users cannot authenticate on RADIUS services and access the internet using GovWifi.
Possible examples are:
- A serious outage with the production platform, for example:
- Failure of one or both AWS regions leading to users being unable to authenticate
- Issue with AWS Elastic Load Balancer leading to increased traffic and authentication failure for users
- Loss of ownership of RADIUS Elastic IPs
- This would mean we’d have to ask organisations to re-configure their infrastructure to use new IP addresses, causing major remedial work and reputational damage
We must:
- respond within 30 minutes during business hours
- give an update every hour
P2 incidents (major)
These are situations where there’s a substantial degradation of the service.
The main criteria are:
- new organisations cannot register to use GovWifi
- new end users cannot sign up to use GovWifi
- existing admin users cannot access GovWifi admin
A possible example is a significant issue with the production platform on one of the AWS regions. For example, if the London region failed, GovWifi admin would not work and users would not be able to sign up to GovWifi. However, authentication for existing users would be fine.
We must:
- respond within 1 hour during business hours
- give an update every 2 hours
P3 incidents (significant)
These are situations where there’s intermittent or degraded service due to a platform issue.
The main criteria are:
- users experience intermittent or degraded service
- the website is intermittently unavailable, or there are assets missing
A possible example is a temporary outage of RADIUS authentication.
We must:
- respond within 2 hours during business hours
- give updates every business day
P4 incidents (minor)
These are situations where there’s a component failure that is not immediately affecting the service.
A possible example is a failure of one of the RADIUS servers (there are 3 in each region).
We must:
- respond within 1 day during business hours
- give updates every 2 business days
Update the Status Page
Add an entry briefly describing the service/s affected and level of impact.
Work to fix the issue
The tech lead should lead the actions to fix the service. All developers - frontend and infrastructure - will help.
Tell the relevant people
The incident lead should make sure that the relevant people know about the incident.
The type of incident will affect who the ‘relevant people’ are and how often they should be updated.
P1 incidents
The incident lead needs to:
- send regular updates to the organisations that offer GovWifi in their buildings - use the latest copy of organisation email addresses or download the list from GovWifi admin
- open an incident on the GovWifi status page and add regular updates. You can do this by logging in with your work email account here. “Choose Login With Google”
- send quick reports of any significant events to the portfolio-incidents email address
- if there’s been a data breach, tell the Data Protection Officer and Cabinet Office, following the process documented in the service team GDPR documentation
We do not contact individual GovWifi users. This is because we would need the technical team to extract user information from the database. This takes time, which would be better spent fixing the service.
P2 incidents
The incident lead needs to:
- send quick reports of any significant events to the portfolio-incidents email address
- open an incident on the GovWifi status page and add regular updates
- if there’s been a data breach, tell the Data Protection Officer and Cabinet Office, following the process documented in the service team GDPR documentation
- talk to the team and decide what other communications are necessary
P3 and P4 incidents
The incident lead should talk to the team and decide what communications are necessary.
The team should decide if an incident should be opened on the GovWifi status page.
Create an incident report
The incident lead needs to create an incident report using the report template.
Throughout the incident, record actions taken, decisions, significant external communications in the report. Include timestamps.
After the incident
When the incident is resolved, the incident lead needs to record the completed template with a unique ID in the Incident Log.
Helping other organisation incident management teams with their incidents
From time to time GovWifi may be requested to assist an organisation that is in incident management mode. The following principles should help make the engagement easier.
Identify the admin user for the organisation
In GovWifi admin, locate the admin users listed for the organisation in question. The incident lead for the organisation is unlikely to be an admin user so invite at least one admin user to any resolution call.
Introduce everyone involved in the incident
When speaking to the organisation’s incident team, make sure everyone has been properly introduced. The team may have a very different structure to ours and we need to understand their roles and hierarchy.
Align the goals of any incident calls/meetings
By default GovWifi should focus on assisting to resolve the issue as quickly as possible for our end users. Organisations may be inclined to understand the root cause of the incident first. We must work with the incident team to ensure our calls/meetings are outcome focused.
Demonstrate empathy with the incident team
Emotions are likely to be running high for the incident team. This can lead to miscommunication. Keep calm and try to empathise with the incident team.
Consider the interoperability of unified communications tools
GovWifi uses Google Meet, other organisations may not use Google Meet. If the unified comms does not work for either team try and find one that works for everyone. Screen shares can be incredibly helpful to both teams so finding a common platform is important.
Be confident in your knowledge of GovWifi
Always remember you are the GovWifi subject matter expert and take confidence in that knowledge.