How to communicate an unplanned system outage

February 8, 2018

Due to unanticipated problems, a Yale application or service may become unavailable. It is important to communicate quickly when a Yale application or service unexpectedly becomes unavailable.

Always:

  • Immediately inform and work with the Major Incident Manager: KB0003747
  • Alert the ITS Help Desk
  • Create a KB and link to Incident or Problem ticket. How to create a knowledgebase article: KB0001798

Full, partial, intermittent, performance issues

It is important to identify the outage type when sending the verbiage for the Status Page update, but it does not affect whether it is a P1 or not. Keep in mind that the categorization of the Incident is based on potential impact and not on actual impact. Regardless of whether “access to the internet” from campus happened during work hours or off shift, it is still a P1. Please follow the ServiceNow process/procedures found in KB0003747

Impactful outage

  • When an outage is confirmed that impacts a large population, or the service is expected to be unavailable for an unknown or extended amount of time, a communication may need to be sent to affected individuals or the entire campus.
  • Submit your request to message@yale.edu with the words URGENT EMAIL REQUEST in the subject of the message. Please provide as much detail as possible.
  • The request must be approved by IT Leadership, Internal Communications, the Major Incident Manager, and the Service Owner (additional approvals may be required).
  • While working with the Service Offering Manager, if the Major Incident Manager feels an issue cannot be resolved within a reasonable amount of time and normal communication vehicles (email/status page) are not available, you may request the Emergency Management Team send a Yale Alert notification.
  • Send system status updates to message@yale.edu on a regular basis to keep the community informed of progress and when the issue has been resolved.

Priority One/Major Incident

  • Service affected: Typically, any service with a Tier 0 DR designation (health, life, safety, foundational) including, but not limited to: active directory, authentication (AD, CAS, Clearpass, DUO, SHIB), Cloudflare, communications tools, firewalls, identity management, impactful APIs (PeopleHub, COA Validator, etc.), important facilities, important websites, Layer 7 gateway, load balancers, messaging, monitoring, network, Oracle Internet Directory, phone services (Blue Phones, Call Centers, Desk Phones, etc.), storage, VMWare, VPN.
     
  • Criticality: Depending on the criticality of a service and the time of the month or year, a service that is otherwise not considered a Major Incident may be raised to a Situational P1. Things that may fall into this category are; services used to meet a university budget or grant submission deadline, process weekly union payroll on the day it is due, commencement, admissions/decisions, student exams, etc.
    In most cases, a Priority 1/Major Incident occurs with a service that broadly affect multiple applications and/or users, but it could also occur when one person/function has been affected and at a sensitive time. Examples include the university needing to send an urgent/important email, the Transplant department without network in their building, etc.

Recently resolved or current suspected outage

If there is suspicion of a current outage or to recognize that an issue occurred recently, post to the System Status page by emailing your request to message@yale.edu. This email account reaches a team of people – Internal Communications, ITS Help Desk, Major Incident Manager, and Data Center Operations - which ensures that these central points of contact are receiving the same information. This is important since all our IT communications direct students, faculty, and staff to the Status Page for updates. Additionally, status page postings help to minimize the number of calls made to the ITS Help Desk. For more specific information on the wording of the posting, please refer to this link.

Emergency CAB

The eCAB addresses requests for modifications due to an urgent business need that occurs out of the normal change cycle that impacts a critical IT service requiring an immediate change to correct the problem. An emergency change request should always be supported by a ServiceNow Incident or Problem ticket, indicating the specifics of the issue being addressed. Only the Change Manager or the designee has the authority to approve or reject an emergency change with the advice or recommendations from the eCAB members. Send your request to eCAB@yale.edu