Knowledge base Resilience
This article explains how we plan our network to provide a reliable internet connection to customers using broadband. It also covers some of our future plans. It also explains how you can make your internet connection more resilient yourself.
What you can do to ensure resilience
Whatever we do to make the service reliable, it is impossible to make any service absolutely 100%, sadly. However, there are things you can do to help and give yourself extra peace of mind.
- One of the most likely causes of an outage of any length is a problem with your line. Simply installing a second line from us will greatly reduce the chances of a failure. We offer an extra line service at little extra cost.
- Given that a carrier, such as BT, can have issues from time to time, making the second line one using another carrier (e.g. Talk Talk wholesale) can add extra resilience whilst still using our Internet connectivity.
- Also, having a simple broadband from another provider is a possibility. It may not have the fixed IP, low latency and high quality of our service, but a cheap broadband from a consumer ISP may be an ideal back-up in case of problems. Bear in mind that another broadband line will most likely use the same cable bundles as the service from us, so a digger cutting something it should not will most likely take out our lines and your back-up.
- A good final back-up is a 3G/4G dongle. This would use totally different network components to provide an internet connections. It is usually costly to use, but if internet is business critical then it makes a useful backup. This has the advantage that used with a laptop you can handle power fail as well as internet fail.
Whatever back-up plans you put in place it is wise to ensure you monitor them so you can be sure they are available if you need them.
The key method of providing resilience in any network is eliminating single points of failure. The basic idea is that things can break, so make the network such that any one thing breaking does not totally break the service being provided.
Obviously it is possible for more than one thing to break. This leads to two other important aspects: (a) using high reliability equipment in the first place (b) monitoring all equipment and links so we know when something breaks. This later point is very important as otherwise a single failure would go un-noticed and when a second failure happens you are stuffed.
The principle has to be applied to equipment (such as routers, switches, and so on), as well as links (fibre and cable) and also supporting services such as power and air conditioning.
The main objectives of the design are:-
- Maintenance of any equipment or link should be possible without any break in service or even dropping a packet.
- Where not possible, such as a PPP re-termination, maintenance should only take a few seconds and be scheduled over night.
- Failure of any link or equipment should cause no break in service.
- In practice any failure is likely to be a few seconds, but should be minimised.
- Where this is not possible, such as re-termination of PPP sessions, reconnection should be automatic and only take a few seconds.
Sadly there will always be some eventuality one cannot plan for - the most difficult is when equipment partly breaks, for example a router that has BGP sessions and VRRP sessions working but is not forwarding packets. In cases like this staff can use remote power control to turn off the faulty equipment quickly and allow the fall-back systems to work.
We are working on systems to allow PPP connections to move seamlessly to a backup termination. This is tricky, but we have had some success in tests. This is part of the R&D work we are doing to try and make fall-back and maintenance as seamless as possible.
There is also a consideration for disaster. This is where all of your efforts for avoiding single points of failure are thwarted by some event that breaks everything. Such things are fire / bomb / theft / or other disaster that takes out a data centre where all of the equipment is housed. It can also include less drastic, but equally serious, events such as a provider (e.g. data centre) going bust or falling out with us.
Handling such eventualities is very difficult. The only way to address such issues it spread equipment around. The problem is that you can end up paying twice for everything making costs impossible for a competitive service. We already have some degree of diversity by utilising a number of data centres for equipment, and obviously regular off site backups. Our core control database is backed up as well as constantly replicated to machines at many different sites, for example.
However, we are currently very reliant on links to BT at one physical location in a London data centre (well, one for 20CN and one for 21CN lines). The plan is, working with BT, to have a secondary link in a different physical location without paying for bandwidth in both locations. The key factor is not technical but commercial - we need to be paying for the bandwidth just once covering both locations. This is not in place yet but a number of ISPs are working with BT to sort the commercial arrangements, and our longer term plan is a second site connectivity recovery plan to be in place some time this year.
Dual link to BT
The links we have from BT which carry the broadband data to customers have redundancy built in. They use dual links.
For 20CN, at present, this means a fibre ring. One part of the ring could break and the service continue. However this then arrives on one piece of equipment and has one link to us where we have one piece of equipment to terminate it. The BT equipment is on 4 hour response, and we have spare termination equipment, but the nature of the link is that a failure needs manual intervention to fix things. The best we can do is fix this type of outage as quickly as possible, which we would hope would be within 1 working hour (if something we can fix) or 4 hours (if BT involved). However, this year, 20CN lines are being connected to the new 21CN links to us.
For 21CN the system is much more sensible as we have two separate links to BT. These links are each big enough for all of the traffic to be carried. However (importantly) both are used and monitored constantly so we know if one link fails even though the service continues when this happens. We have had a link failure (due to the equipment at the end being faulty) during the initial trials and were able to identify the intermittent problem, switch all traffic the working link, and wait for BT to rectify the broken link.
Also for 21CN as the links are totally separate at our end we have a separate piece of equipment on the end. This means if one of our bits of kit fails we have a fall-back in place automatically. This is much better than the 20CN arrangement.
Power and air conditioning are something the data centre do and are pretty good at. They have large UPS systems and generators already. Whilst some of the data centres we use have dual power feeds to us, one of the crucial ones in the connectivity does not (really!). Therefore as part of our upgrades planned for early this year (2009) we will be installing some UPS equipment in the rack. This should not be necessary in a data centre! Also, a UPS is a point of failure (and they do fail). The plan is that dual power fed equipment will be fed from one feed that is from the data centre and the other via the UPS. The single feed equipment will have one of the pair on data centre power and one on the UPS. This means that if the UPS fails we still have one of everything working, and similarly if the power feed fails we have one of everything working. Unfortunately a power failure in the data centre could affect our transit providers and mean service is impacted.
We operate dual routers. We have dual L2TP routers connected to BT (for 21CN) and dual BGP routers connected to transit and peering. At present (Feb 2009) we have one router in one data centre and two in another and they work together over both of the data centres. We are in the process of moving from one data centre to another and have been for nearly a year. This was all expected to happen in a short time-scale in 2008 but delays in BT IPStream connect service on WBMC mean that we are still in the process of moving. As such we have a number of planned upgrades. At present these three routers work as a pair in each data centre and either can fail or be shut down for administration as required without any noticeable impact on service.
The longer term plan over the coming months is dual FB6000 BGP routers in the data centre. Introducing these is something that will be done carefully, one step at a time over some weeks.
Dual links to transit
We have dual links to transit. Well, strictly speaking, we have dual transit providers and dual links to those transit providers. We also have multiple peering connections some of which have dual links as well. These links connect to the separate BGP routers. The theory is that we have resilience for failure of links or transit or peering. A failure like this can mean a few seconds of disrupted routing while BGP catches up, but it is automatic.
Dual links for Ethernet customers
Part of our upgrade work is aimed at allowing for new Ethernet links to customers which we hope to be launching around April 2009. This will again mean dual routers at our end and eventually dual links to BT as well. The plan is to separate the Ethernet services from our broadband services as much as possible to allow broadband to be a sensible fall-back for Ethernet links.
We use dual switches to connect equipment. The dual links connect to the dual routers and dual transit feeds using the dual switches. The idea is that a switch failing will mean that one link, router and transit will all go off line and everything will work on the other. At present (Feb 2009) the setup is not fully automatic in the event of a switch failure, but the plan is that this will be fully automatic and tested within the next few months. As this is a live service we take the upgrades very carefully, one step at a time.