Restoring to normal

Illo/Stock (techie global in blue)
Illo/Stock (techie global in blue)

Blue screen locks out thousands

Most people had never heard of CrowdStrike until the morning of Friday, July 19, when Windows’s signature blue screens appeared on computers, locking millions of devices and disrupting daily life worldwide. This wasn’t a typical outage; nothing like this had ever happened before on a global scale, and little information was available on how to fix it.

CrowdStrike, a monitoring tool used to protect against cyber-attacks and malware, malfunctioned due to a faulty update. The update caused world-wide disruptions to Microsoft Windows operating systems and blue-screened over 3,000 Yale computers, locking out individuals.

An IT Major Incident Task Force was assembled at 1:45 a.m. that morning. It included approximately 40 colleagues from Information Technology Services (ITS), distributed campus IT partners, professional schools, libraries, Emergency Management, Public Affairs and Communications, and Human Resources. For the following eight days, regular routines and plans were put aside as the team worked collaboratively to diagnose, communicate, schedule, and restore Yale’s Windows environment.

A full team effort

Within the first hours of the incident, the task force quickly realized there were two distinct threads affected: Yale’s servers and workstations. Many of Yale’s most critical systems run on Windows server machines. Yale’s Windows server teams within ITS and distributed campus partners immediately focused on Yale’s health, life, and safety systems and lessening disruptions to patient care and research. They worked to restore hundreds of Yale’s enterprise systems from recent backups — before Crowdstrike distributed information on how to fix the problem. All Yale servers were restored by the following day, a truly impressive effort and a testament to the work technical teams had already implemented to make Yale’s systems redundant and resilient.

“I am immensely proud of the progress in our resiliency built over the last several years that limited our damage, as well as the incredible teamwork and creativity of everyone involved in getting us fully back online,” said Vice President for Information Technology and Campus Services John Barden.

The next priority was to get the remaining Window’s computers that received the bad update up and running. However, there was a catch: the only way to get machines out of the locked state was to fix them manually.

Teams assisted individuals in Connecticut who could bring their laptops to campus. Schools and departments offered space to stand up over a dozen walk-in locations. Numerous staff volunteered to work extended hours throughout the weekend to restore as many laptops as possible. Over 1,000 faculty and staff took advantage of the walk-in support available and were happy to see their blue screens disappear.

“The response to the Crowdstrike event was amazing,” said Karen Roberts, associate director for Yale Medicine finance. “The way Yale IT organized so quickly was impressive. I had the blue screen, and went to 25 Science Park at 8:30 Saturday morning. I was in and out and much appreciative of the team and the support provided.”

One question remained: With no remote solution offered by Crowdstrike, and Yale faculty and staff as far away as Africa, how could ITS help individuals who have no option for hands-on tech support? Within two days, a cross-functional team designed a solution to fix Yale-affected machines remotely, walking faculty and staff through the process — a welcome relief to those near and far.

Back online

Nearly 100 support staff from across the university handled calls, staffed pop-up walk-in centers, and made desk-side visits to members of the Yale community. “The Crowdstrike issue in July was an unprecedented global incident that impacted travel, major service providers, corporations, healthcare systems, and research institutions like our own,” said Barden. “I am grateful for the patience of our community while IT practitioners across the university worked to fully restore services.”