The prospect of being hit by a denial-of-service campaign through massive DDoS attack has gotten individual organizations and DNS hosting providers on the edge. As the cost of unplanned data center outages having increased 38 percent in 5 years to nearly $9,000 a minute, or about half a million dollars an hour, the concerns are obvious.
That being said, not all network outages are instigated by malicious outside forces. Recently, Southwest Airlines experienced a router failure that caused the cancellation of about 2,300 flights in four days. The router took down several Southwest Airlines systems and the outage continued uninterrupted for about 12 hours when the backup systems didn’t work as expected. Software can be a cause for outages as well. Last year, the New York Stock Exchange experienced an outage caused by a software update that crippled the exchange platform. It was the longest technology-related disruption in recent memory.
"Your network and devices should be continuously tested as part of standard IT operations"
An outage can strike anywhere, and even a minor internal problem can create a ripple effect that can cause widespread disruption—costing an organization money and consumer trust. How can we mitigate the risk of such an outage happening? Some vulnerabilities can be addressed by adding layers of protection, others through network architecture decisions, and others with better testing.
Where Internal Failures Happen
Single Points of Failure: Without a thoughtful architecture, serial inline deployments of security monitoring systems, in which traffic is passed from one security appliance to another, can lead to a lot of headaches if one appliance fails. In cases where an appliance failure does happen, like the router incident above, the entire traffic pathway can fail resulting in a catastrophic network outage. In other words, single devices placed inline without backup paths can have a huge impact on business operations. In these cases, fixes can be difficult to implement when failures occur, as it can often be difficult to identify and isolate any individual issue in a short period of time.
Network Blind Spots: With the proliferation of network segments across multiple Clouds, SaaS software, multi-site locations, and the increasing implementation of IoT within the enterprise, the risk of blind spots increases. Some devices are not noticed when they connect to the network, as they are hard to keep track of since many lack monitoring agents and tools. Further, the sheer number of them integrating into established infrastructures increases the risk that some or all are missed.
The Human Element: We all understand that technology is only as effective as the humans operating them, and humans are prone to err. Last year, Avaya found that 81 percent of IT pros cited human error (e.g. configuration mistakes) had taken them offline.
Testing End-to-End: A typical enterprise can be running network security, endpoint security, application security, and security management from different vendors. Each of these devices require regular maintenance and software updates. Even if every security device works exactly as it should, making sure they are all working properly– individually and together–requires continuous testing.
Taking Steps to Prevent the Problem
The first step is to accept the reality that you cannot prevent all network outages—it is critical to develop a disaster recovery and business continuity plan. It starts with having a good network plan. Be specific about what risks are acceptable, and which are not, for better prioritization.
Second, it is important to include safety systems in your network design. Integrating external bypass switches with your inline security devices acts as a fail-safe protecting the network in the instance when one inline tool goes down.
Depending on the enterprise size and number of security, monitoring, analytics, and compliance tools you are running, you may be losing data in the mirroring process. Using dedicated network taps instead of SPAN ports and a Network Packet Broker (NPB) for data distribution creates a much more stable network architecture. Better visibility reduces downtime and troubleshooting.
Finally, once your network is architected, all that is left is to test, train, and test more. Your network and devices should be continuously tested as part of standard IT operations. This should be done with loads that reflect real world scenarios, particularly in the event of application and network changes.
Incidents where an IT problem rapidly becomes a major outage can happen to any business. External attacks are not always in your control. Increasing your resilience, maximizing your troubleshooting insight, and minimizing recovery time is much more in your control. Prepare your network with more resilience, failover planning, and realistic testing. React to your network outages before they happen.