Skip to main content

Respond to alerts and incidents

This section is a quick reference version of the full incident management document.

Types of alert and where you’ll see them

Alerts come from different sources.

Sentry alerts

Emails are sent to the GovWifi developers mailbox. Notifications are sent to the #govwifi-monitoring Slack channel.

AWS Cloudwatch alerts

Emails are sent to GovWifi critical alerts mailbox. Notifications are sent to the #govwifi-monitoring Slack channel.

StatusPage alerts

Emails are sent to the GovWifi support mailbox. Notifications are sent to the #govwifi Slack channel.

PaaS alerts

Emails are sent to the GovWifi support mailbox.

Notify alerts

Emails are sent to the GovWifi support mailbox and StatusPage.

Tell people if you see an alert

  1. Share a brief summary of what you’ve seen on the #govwifi Slack channel.
  2. Find out if anyone else is already investigating the issue.
  3. If the issue is security related, tell the Cyber Security Team. You can use the #cyber-security-help Slack channel or the GDS Rotas app.

Appoint an incident lead

Make sure the product manager, delivery manager or tech lead is aware of the incident.

One of them will be the ‘incident lead’.

Categorise the incident

The tech lead should lead the categorisation of the incident and discuss it with the relevant people.

P1 incidents (critical)

These are situations where there’s:

  • a complete outage
  • unauthorised access

The main criteria is that registered users cannot authenticate on RADIUS services and access the internet using GovWifi.

Possible examples are:

  • a serious outage with the production platform - for example, a failure of one or both AWS regions, leading to a loss of functionality and authentication for users an issue with AWS Elastic Load Balancer leading to increased traffic and authentication failure for users
  • loss of ownership of RADIUS Elastic IP, meaning we’d have to ask organisations to re-configure their infrastructure to use new IP addresses, causing major remedial work and reputational damage

We must:

  • respond within 30 minutes during business hours
  • give an update every hour

P2 incidents (major)

These are situations where there’s a substantial degradation of the service.

The main criteria are:

  • new organisations cannot register to use GovWifi
  • new end users cannot sign up to use GovWifi
  • existing admin users cannot access GovWifi admin

A possible example is a significant issue with the production platform on one of the AWS regions. For example, if the London region failed, GovWifi admin would not work and users would not be able to sign up to GovWifi. However, authentication for existing users would be fine.

We must:

  • respond within 1 hour during business hours
  • give an update every 2 hours

P3 incidents (significant)

These are situations where there’s intermittent or degraded service due to a platform issue.

The main criteria are:

  • users experience intermittent or degraded service
  • the website is intermittently unavailable, or there are assets missing

A possible example is a temporary outage of RADIUS authentication.

We must:

  • respond within 2 hours during business hours
  • give updates every business day

P4 incidents (minor)

These are situations where there’s a component failure that is not immediately affecting the service.

A possible example is a failure of one of the RADIUS servers (there are 3 in each region).

We must:

  • respond within 1 day during business hours
  • give updates every 2 business days

Work to fix the issue

The tech lead should lead the actions to fix the service. All developers - frontend and infrastructure - will help.

Tell the relevant people

The incident lead should make sure that the relevant people know about the incident.

The type of incident will affect who the ‘relevant people’ are and how often they should be updated.

P1 incidents

The incident lead needs to:

  • send regular updates to the organisations that offer GovWifi in their buildings - use this list of organisation email addresses or download the list from GovWifi admin
  • open an incident on the GovWifi status page and add regular updates
  • send quick reports of any significant events to the portfolio-incidents email address
  • if there’s been a data breach, tell the Data Protection Officer and Cabinet Office, following the process documented in the service team GDPR documentation

We do not contact individual GovWifi users. This is because we would need the technical team to extract user information from the database. This takes time, which would be better spent fixing the service.

P2 incidents

The incident lead needs to:

  • send quick reports of any significant events to the portfolio-incidents email address
  • if there’s been a data breach, tell the Data Protection Officer and Cabinet Office, following the process documented in the service team GDPR documentation
  • talk to the team and decide what other communications are necessary

P3 and P4 incidents

The incident lead should talk to the team and decide what communications are necessary.

Create an incident report

The incident lead needs to create an incident report using the report template.

Throughout the incident, record actions taken, decisions, significant external communications in the report. Include timestamps.

After the incident

When the incident is resolved, the incident lead needs to record the completed template with a unique ID in the Incident Log.

Helping other organisation incident management teams with their incidents

From time to time GovWifi may be requested to assist an organisation that is in incident management mode. The following principles should help make the engagement easier.

Identify the admin user for the organisation

In GovWifi admin, locate the admin users listed for the organisation in question. The incident lead for the organisation is unlikely to be an admin user so invite at least one admin user to any resolution call.

Introduce everyone involved in the incident

When speaking to the organisation’s incident team, make sure everyone has been properly introduced. The team may have a very different structure to ours and we need to understand their roles and hierarchy.

Align the goals of any incident calls/meetings

By default GovWifi should focus on assisting to resolve the issue as quickly as possible for our end users. Organisations may be inclined to understand the root cause of the incident first. We must work with the incident team to ensure our calls/meetings are outcome focused.

Demonstrate empathy with the incident team

Emotions are likely to be running high for the incident team. This can lead to miscommunication. Keep calm and try to empathise with the incident team.

Consider the interoperability of unified communications tools

GovWifi uses Google Meet, other organisations may not use Google Meet. If the unified comms does not work for either team. Screen shares can be incredibly helpful to both teams so try and find one that works for everyone.

Be confident in your knowledge of GovWifi

Always remember you are the GovWifi subject matter expert and take confidence in that knowledge.

This page was last reviewed on 21 July 2021. It needs to be reviewed again on 21 October 2021 by the page owner #govwifi .
This page was set to be reviewed before 21 October 2021 by the page owner #govwifi. This might mean the content is out of date.