aa.net.uk Broadband - Broadband you can work with

Skip to Navigation / Skip to Content

News 30th Aug 2011 - Maidenhead Fibre Break

At shortly after 1pm on 30th August 2011 there was a major fibre break in Maidenhead affecting services to the data centre which we use. The issue persisted until shortly after 10pm.

Impact

The break affected connectivity to our offices and links for Ethernet services.

  • All Ethernet customers lost service totally for the whole time of the incident (over 9 hours).
  • Calls to our offices were disrupted for a short period - automated fall backs meant that some calls were handled on mobiles and off-site staff, but this meant delays answering calls. More mobiles were set up.
  • Staff left irc, and were unable to access emails for a short period.
  • Staff were unable to update status pages for a short period.

Cause

The cause was a physical fibre break somewhere near the data centre - we are trying to obtain more detailed information on this.

Repair

BT have spliced the damaged fibres - services were fully restored shortly after 10pm.

Issues

There were several issues with the handling of this incident.

  • The main one was the time taken to repair the fault by BT. These circuits have a target 5 hour fix time. We are discussing this with BT now.
  • We had issues with our emergency plans, which meant delays confirming the fault was BT and reporting it*
  • BT had issues handling the fault report, which meant they did not actually get the fault on their system for nearly an hour*
  • Our backup systems for office communications did not work as planned. We have emergency plans and pre-configured DSL routers and 3G dongles and mifis. However, for various silly reasons each of these did not work right away. This meant some delay setting up the equipment and getting it working.
  • Consequently, status pages were not initially posted for nearly half an hour. After this there were regular status updates with all available information.

*The delays getting the report in to BT would normally be a major factor - however this fault did not just affect us, and by the time the engineer arrived on site to resolve the fault we reported there were already BT engineers working on the break as a result of fault reports from other data centre users that were also affected. This was lucky on this occasion, but we need to ensure that any future incident is handled better.

Actions

We have a number of key actions we are taking to improve handling of such incidents in future.

  • The most important aspect is that we are setting up a process of reviewing and testing our emergency plans each month. Some things cannot be tested, but will be read through and checked (i.e. we are not going to disconnect all broadband lines to test things every month). We are however going to unplug our offices briefly every month. We are also going to call BT regularly to confirm we have to right contact for faults and the right circuit IDs.
  • We are updating the emergency plans as well, in particular the order that checks are done and improving how quickly someone goes to site to check the fibre from the far end.
  • We have changed the way we handle the fallback to DSL in the office so that it works now, and the regular testing should ensure we are not caught out again.
  • We plan to put together a best practice guide for customers as to how best to set up redundancy for their offices.
  • We are considering a second POP to allow customers to buy two Etherflows for added redundancy.

Knowledge Base

  • All the technical information your geek heart could desire.
Find out more