Network Availability
Network availability, in this section of the course,
we're going to talk about network availability.
Now, network availability is a measure
of how well a computer network can respond to connectivity
and performance demands that are being placed upon it.
This is usually going to be quantitatively measured
as uptime, where we count the amount of time
the network was up,
and divide that by the total amount of time
covered by the monitoring period.
For example, if I monitor the network for a full year
and I only had five minutes and 16 seconds of downtime
during that year, this would equate to an uptime of 99.999%.
Now this is known as the five nines of availability.
And it's considered the gold standard
in network availability.
This is considered to be an extremely available
and high quality network,
but there is going to be downtime in your networks.
It's a fact of life, devices fail, connections go down,
and incorrect configurations are sometimes going to be applied.
There are a lot of reasons that downtime occurs,
but our goal is to minimize that downtime
and increase our availability.
In order to reach the highest levels of availability,
we need to build in availability and redundancy
into our networks from the beginning.
We also are going to use quality of service
to ensure that our end-users are happy with the networks
and the services that we're providing them.
So in this section on network availability,
we're really going to focus on two domains.
We're going to see domain two,
which is network implementations,
and domain three, network operations.
In here, we're going to talk about two objectives.
Objective 2.2 and objective 3.3.
Now objective 2.2 states, that you must compare and contrast
routing technologies and bandwidth management concepts.
Objective 3.3 states,
that you must explain high availability
and disaster recovery concepts
and summarize which is the best solution.
So let's get started talking
all about the different ways for us
to increase the availability, reliability,
and quality of service within our networks
in this section of the course.
High availability.
In this lesson,
we're going to talk all about high availability.
Now, when we're talking about high availability,
we're really talking about making sure our systems are up
and available.
Availability is going to be measured in what we call uptime
or how many minutes or hours you're up and available
as shown as a percentage.
Usually, you're going to take the amount of minutes
you were up,
divided by the total amount of minutes in a period,
and that gives you a percentage known as uptime.
Now, we try to maintain what is known as the five nines
of availability in most commercial networks.
This is actually really hard because that's 99.999%.
That means I get a maximum of about five minutes
of downtime per year,
which is not a whole lot of downtime.
In some cloud based networks,
they aim for six nines of availability or 99.99999%.
This equates to just 31 seconds of downtime
each and every year.
Now, as you imagine,
I need more than 31 seconds of downtime
or even five minutes of downtime
fueled up patch my servers and install a new hard drive
or put in a new router or switch when one fails.
So, how do I maintain that high level of availability?
Well, I'm going to do that,
by designing my networks to be highly available.
Now, there are two terms you need to understand
and be familiar with,
when we talk about high availability.
There is availability and reliability,
and these are different things.
When I'm talking about availability,
this is concerned with being up and operational.
When I talk about reliability,
I'm concerned with not dropping packets
inside of my network.
If your network is highly available,
but it's not reliable,
it's not a very good network
because it's dropping things all the time
and isn't doing what it's supposed to.
But conversely,
you can have a really highly reliable network,
but if it's not a highly available one,
nobody can use it either because it's down all the time.
So that wouldn't be good either.
So, let's say I had the most highly reliable network
in the entire world,
but it's only up 20 minutes a year.
That's not going to be any good, right?
So, we want to make sure we balance these two things.
We have to aim for good enough in both areas
to meet our business needs based on the available resources
and the amount of money we have to build our networks.
So, when we measure our different network components,
we have to determine how highly available they are.
And we do that through measurement of MTTR and MTBF.
Now, MTTR is the mean time to repair.
This measure is the average time it takes to repair
a network device when it breaks.
After all,
everything is going to break eventually.
So, when a device breaks,
how long does it take for you to fix it?
And how much downtime are you going to experience?
That is what we're trying to measure
when we deal with the mean time to repair.
Now, the mean time between failures or MTBF,
is going to measure the average time
between when a failure happens on a device
and the next failure happens.
Now, these two terms can often be confusing.
So, let me display it on a timeline and explain a little bit
about what they look like in the real world.
Now, let's say I had a system failure
at this first stop sign here on the left side.
Then we resume normal operations because we fix things.
That amount of time, was the time to repair.
Now, if I average all the times to repair
over the entire year for that type of device,
that's going to give me my MTTR,
my mean time to repair, the average time to repair.
Now, on the failure side of things,
we want to measure the failure time from one failure
thus using it, fixing it,
and then another failure happens.
This becomes the time between the failures.
If I average all those together,
I get the average time between failures
or the mean time between failures, MTBF.
Hopefully, you can see the difference here.
Remember, when we're dealing with mean time to repair,
we want this to be a very small number.
When we deal with the mean time between failures,
we want this to be a very large number.
This means,
and for the very small number for mean time to repair,
we can fix things really quickly
and get ourselves back online.
So, the lower the mean time to repair is,
the better the network availability.
Now, on the other hand,
we start talking about mean time between failures,
we want a really long time
because this means that the device has stayed up
and operational for a very long time before they fail.
This is going to give us better network availability,
and overall, it should give us better reliability too.
Now, we don't want a lot of failures here.
And so the more time in between failures,
the better that is for our network.
So, how do we design these networks
to be highly reliable and highly available?
Well, we're going to add redundancy to our networks
and their devices.
Now, redundancy can be achieved through a single device
or by using multiple devices.
If you're using a single device,
you're still going to have single points of failure
in your network,
but it is cheaper than being fully hardware redundant.
Let's take a look at this concept for a moment.
Here you could see a single point of failure in my network.
Even though I have two switches and multiple connections
between those switches,
which gives me additional redundancy,
that router is not giving me additional redundancy.
It's a single point of failure
because it's the only router I have.
So, even if the router has internal hardware redundancy,
like two power supplies and two network cards,
I still only have one router chassis and one circuit board
running in that router.
So, if that router goes down,
this entire network is going to stop.
Therefore, this is considered a single point of failure.
Now instead,
I could redesign the network
and I can increase its redundancy by doing this.
Notice, I now have two PCs that want to talk to each other.
And each of them has dual network interface cards
talking to two different switches.
And each of those switches talks to two different routers.
Everything is connected to everything else
in a mesh topology for these network devices.
This gives me multiple connections between each device
and provides me with link redundancy, component redundancy,
and even inside those devices,
I may have two network cards, two power supplies,
and to every other internal network component there is,
so that I have a very redundant
and highly available network.
Now, if one of those routers needs to be upgraded,
I can take it offline and update its firmware,
and then the entire time that second router
is still on the network,
maintaining the load and providing service to all the users.
Then I can put the first router back on the network,
take off the second router and then do its upgrades.
By doing this and taking turns,
I still am able to have network functions run,
and I have no downtime associated with this network.
This is how you keep a network highly available.
Now, let's talk a little bit more about hardware redundancy.
Inside these routers and other network devices,
we can have hardware redundancy or the devices themselves
could be hardware redundant.
Now, if I have two routers and they're both
serving the same function,
this is considered hardware redundancy.
I could also have hardware redundancy in the components
by having two network cards or two hard drives
or two internal power supplies on a single device.
That way, if one of them fails, the second one takes over.
Usually, you're going to find this
in strategic network devices,
things like your switches, your routers, your firewalls,
and your servers,
because you can't afford a failure
in any one of those devices,
because they would take down large portions
of your network or its services.
On the other hand, if I'm considering my laptop,
I only have one hard drive in it.
If that laptop fails or that hard drive fails,
I would just deal with that downtime.
I might buy a new laptop or a new hard drive
and then restore from an old backup.
That would get me back up and running.
Now, when we're working with end-user devices
like workstations and clients,
we often don't deal with redundancy.
But when you start getting to servers and routers
and switches and firewalls,
you need to start having hardware
and component level redundancy
because these serve lots of end-users.
We deal with this redundancy,
we can then cluster our devices and have them work either
an active-active,
or active-passive configuration.
All right.
Let's assume I have this one computer
and it has two network interface cards
that are connected to the network.
Do I want to talk to both routers at the same time?
Well, if I'm active-active,
then both of those network interface cards
are going to be active at the same time,
and they each are going to have their own Mac address,
and they're going to be talking at the same time
to either of these two routers.
This can then be done to increase the bandwidth
of this computer and load balance
across both network interface cards.
This is known as Network Interface Card teaming,
or NIC teaming,
where a group of network interface cards,
is used for load balancing and failover for a server
or another device like that.
Now, on the other hand,
we can use active-passive,
and this is going to have a primary
and a backup network interface card.
Now, one of these cards is going to be active
and being used at all times.
And when it fails,
the other card is going to go from standby and take over.
In this case,
there is no performance increased by having two cards,
but I have true redundancy and failover capabilities.
In an active-passive configuration,
both NICs are going to be working together
and they're going to have a single Mac address
that they're going to display to the network,
so they look like they're a single device.
Now, when you start looking at redundancy at layer three,
we're going to start talking about our routers.
Now here,
our clients are getting configured with a default gateway,
which is our router by default.
But, if the default gateway went down,
we wouldn't be able to leave the sub-net,
and so it'd be stuck on the internal network.
Now, we don't want that.
So instead,
we want to add some redundancy
and we'll use layer three redundancy
using a virtual gateway.
To create a virtual gateway,
we need to use either
the First Hop Redundancy Protocol, FHRP,
or the Virtual Router Redundancy Protocol, VRRP.
Now, the most commonly used First Hop Redundancy Protocol
is known as HSRP or the Hot Standby Router Protocol.
This is a layer three redundancy protocol
that's used as a proprietary First Hop Redundancy Protocol
in Cisco devices.
HSRP is going to allow for an active and a standby router
to be used together.
And instead,
we get a virtual router that's defined
as our default gateway.
The client devices like the workstations and servers
are then going to be configured to use the virtual router
as its gateway.
When the PC communicates to the IP of the virtual router,
the router will determine which physical router is active
and which one is standby.
And then, it forwards the traffic to that active router.
If the active router goes down,
the standby router will pick up the responsibility
for that active router
until the other router comes back online
and takes over its job again.
Now, with VRRP,
the Virtual Router Redundancy Protocol,
this is one that was created
by the internet engineering task force.
Is an open standard variant
of the Hot Standby Router Protocol or HSRP.
VRRP allows for one master or active router,
and the rest can then be added in a cluster as backups.
Unlike HSRP,
where you can only have one router as active,
and one as standby.
VRRP is going to allow you to have multiple standby routers.
Just like HSRP,
you're going to configure the VRRP
to create a virtual router
that's going to be used as a default gateway
for all of your client devices.
Now, in order to provide load balancing on your networks
and to increase both redundancy
and performance of your networks,
you can use GLBP,
which is the Gateway Load Balancing Protocol,
or you can use LACP the Link Aggregation Control Protocol.
Now, GLBP or the Gateway Load Balancing Protocol
is a Cisco protocol,
and it's another proprietary First Hop Redundancy Protocol.
Now, GLBP will allow us to create a virtual router
and that virtual router will have two routers
being placed behind it,
in active and standby configuration.
The virtual router or gateway will then forward traffic
to the active or standby router
based on which one has the lower current loading,
when the gateway receives that traffic.
If both can support the loading,
then the GLBP will send it to the active
since it's considered the primary device.
By using GLBP,
you can increase the speeds of your network
by using load balancing between two routers or gateways.
Now, the second thing we can use,
is LACP or Link Aggregation Control Protocol.
This is a redundancy protocol that's used at layer two.
So we're going to be using this with switches.
LACP is going to achieve redundancy by having multiple links
between the network devices,
where load balancing over multiple links can occur.
The devices are all going to be considered
part of a single combined link
even when we have multiple links.
This gives us higher speeds and increases our bandwidth.
For example,
let's pretend I have four Cat5 cables.
Each of these are connected to the same switch.
Now, each of those cables has 100 megabits per second
of bandwidth.
Now, if I use the Link Aggregation Control Protocol,
I can bind these altogether and aggregate them
to give me 400 megabits per second
of continuous bandwidth by creating
this one single combined bandwidth,
from those four connections.
Now, let's consider what happens
when that is trying to leave our default gateway
and get out to the internet.
Now in your home,
you probably only have one internet connection,
but for a company,
you may wish to have redundant paths.
For example, at my office,
we have three different internet connections.
The first is a microwave link that operates
at 215 megabits per second for uploads and downloads.
The second, is a cable modem connection.
It operates at 300 megabits per second for downloads
and 30 megabits per second for uploads.
Now the third is a cellular modem,
and that gives me about 100 megabits per second
for downloads and about 30 megabits per second for uploads.
Now, the reason I have multiple connections,
is to provide us with increased speed and redundancy.
So, to achieve this,
I take all three of these connections
and connect them to a single gateway
that's going to act as a load balancer.
If all the connections are up and running,
they're going to load balance my traffic
across all three of those connections
to give me the highest speeds at any given time.
But, if one of those connections drops,
the load balancer will remove it from the pool
until it can be returned to service.
By doing this,
I can get a maximum speed of about 615 megabits per second
for a combined download.
And on the upload,
I can get about 310 megabits per second,
when using all three connections
and they're all functioning in online.
Similarly,
you might be in an area where you can get fiber connections
to your building.
Now, in those cases,
you may purchase a primary and a backup connection.
And if you do,
you should buy them from two different providers.
If both of your connections are coming from the same company
and they go down,
well, guess what?
You just lost both of your connections
because the upstream ISP might be down.
For this reason,
it's always important to have diversity in your path
when you're creating link redundancy,
just like I did in my office.
I have a microwave connection through one ISP.
I have a cable modem through another ISP,
and I have a cellular modem through a third ISP.
That way,
if anyone goes down,
I still have two other paths I can use.
Now, the final type of redundancy that we need to discuss,
is known as multipathing.
Multipathing is used in our storage area networks.
Multipathing is used to create more than one physical path
between the server and its storage devices.
And this allows for better fault tolerance
and performance enhancements.
Basically, think of multipathing
as a form of link aggregation,
but instead of using it for switches,
we're going to use it for our storage area networks.
In the last lesson I showed you a couple of diagrams
of redundant networks,
but one of the things we had to think about in this lesson
are the considerations we have
when we start designing these redundant networks.
First you need to ask yourself,
are you going to use redundancy in the network,
and if so, where, and how?
So are you going to do it from a module or a parts perspective?
For instance, are you going to have multiple power supplies,
multiple network interface devices, multiple hard drives,
or are you going to look at it more from a chassis redundancy
and have two sets of routers or two sets of switches?
These are things you have to think about.
Which one of these are you going to use,
because each one is going to affect the cost
of your network, based on the decisions you make.
You have to be able to make a good business case
for which one you're going to use, and why.
For instance, if you could just have
a second network interface card or a second power supply,
that's going to be a lot cheaper
than having to have an entire switch
or an entire extra router there.
Now, each of those switches and routers,
some of these can cost 3 or 4 or $5,000,
and so it might be a lot cheaper
to have a redundant power supply, right,
and so these are the things you have to think about
and weigh as you're building your networks.
Another thing you have to think about
is software redundancy,
and which features of those are going to be appropriate.
Sometimes you can solve a lot of these redundancy problems
by using software as opposed to hardware.
For example, if you have a virtual network setup,
you could just put in a virtual switch
or a virtual router in there,
and that way you don't have to bring
another real router or real switch in,
that can save you a lot of money.
There's also a lot of other software solutions out there,
like a software RAID,
that will give you additional redundancy
for your storage devices,
as opposed to putting in an extra hard drive chassis,
or another RAID array or storage area network.
Also, these are the types of things
you have to be thinking about
as you're building out your network, right?
When you think about your protocols,
what protocol characteristics
are going to affect your design requirements?
This is really important if you're designing things,
and you're using something like
TCP versus UDP in your designs,
because TCP has that additional redundancy
by resending packets, where UDP doesn't,
this is something you have to consider as well.
As you design all these different things,
all of these different factors are going to work together,
just like gears, and each one turns another,
and each one is going to feed another one,
and the more reliability and availability
you get in your networks
by adding all these components together.
In addition to all this,
there are other design considerations
that we have to think about as well,
like what redundancy features should we use
in terms of powering the infrastructure devices?
Are we going to have internal power supplies
and have two of those, and have them redundant?
Or, are we going to have battery backups, or UPSs,
are we going to have generators?
All of these things are things you have to think about,
and I don't have necessarily the right answers for you,
because it all comes down to a case-by-case basis.
Every network is going to be different,
and every one has its own needs
and its own business case associated with it.
The network that I had at former employers
were serving hundreds of thousands of clients,
and those were vastly different than the ones
that are servicing my training company right now,
with just a handful of employees.
Because when you're dealing with your network design
and your redundancies,
you have to think about the business case first.
Each one is going to be different
based on your needs and your considerations.
What redundancy features should be used
to maintain the environmental conditions of your space?
If you have good power and space and cooling,
you need to make sure
that you're thinking about air conditioning,
and do you have one unit or two?
Do you have generators onsite?
Do you have additional thermal heating or thermal cooling?
All of these things are things you have to think about.
What do you do when power goes down?
What are some of those things
that you're going to have to deal with
if you're running a server farm
that has to have units running all the time,
because it can't afford to go down
because it's going to affect
thousands and thousands of people,
instead of just your one office with 20 people?
All of these are things you have to consider
as you think about it.
In my office, we made the decision
that one air conditioning unit was enough,
because if it goes down, we might just not work today
and we'll come to work tomorrow, we can get over that.
But in a server farm,
we need to make sure we have multiple air conditioners,
because if that goes down
it can actually burn up all the components, right?
So we have to have additional power and space and cooling
that are fully redundant,
because of that server infrastructure
that we're supporting there.
These are the things you have to balance in your practices.
And so when you start looking at the best practices,
I want you to examine your technical goals
and your operational goals.
Now what I mean by that is,
what is the function of this network?
What are you actually trying to accomplish?
Are you trying to get to 90% uptime, or 95%, or 99%,
or are you going for that gold standard
of five nines of availability?
Every company has a different technical goal,
and that technical goal is going to determine
the design of your network.
And you need to identify that
inside of your budgeting as well,
because funding these high-availability features
is really expensive.
As I said, if I want to put a second router in there,
that might cost me another 3,000 or $5,000.
In my own personal network,
we have a file server, and it's a small NAS device.
We're not comfortable just having all of our devices there,
so we decided we weren't comfortable
having all of our file storage on a single hard drive,
and we built this NAS array instead,
so if one of those drives goes out,
we have three others that are carrying the load.
This is the idea here.
Now, eventually we decided we didn't need that NAS anymore,
and so we replaced that NAS enclosure with a full RAID 5.
Later on we took that full RAID 5
and we switched it over to a cloud server
that has redundant backups
in two different cloud environments.
And so all of these things work together
based on our decisions,
but as we moved up that scale
and got more and more redundancy,
we have more and more costs associated.
It was a lot cheaper just to have an 8-terabyte hard drive
with all of our files on it,
then we went to a NAS array
and that cost two or three times that money,
then we went to a full RAID 5
and that cost a couple more times that,
then we went to the cloud and we have to pay more for that.
Remember, all your decisions here
are going to cost you more money,
but if it's worth it to you, that would be important, right,
and so these are the things you have to balance
as you're designing these fully redundant networks,
based on those technical goals.
You also need to categorize
all of your business applications into profiles,
to help with this redundancy mission
that you're trying to go and accomplish here.
This will really help you as you start
going into the quality of service as well.
Now if I said, for instance,
that web is considered category one
and email is category two
and streaming video's going to be category three,
then we can apply profiles
and give certain levels of service
to each of those categories.
Now we'll talk specifically of how that works
when we talk about quality of service in a future lesson.
Another thing we want to do
is establish performance standards
for our high-availability networks.
What are the standards that we're going to have to have?
These standards are going to drive
how success is measured for us,
and in the case of my file server, for instance,
we measure success as it being up and available
when my video editors need to access it,
and that they don't lose data,
because if we lost all of our files,
that'd be bad for us, right?
Those are two metrics that we have,
and we have numbers associated with each of those things.
In other organizations, we measure it based on the uptime
of the entire end-to-end service,
so if a client can't get out to the internet for an ISP,
that would be a bad thing, that's one of their measurements.
Now the other one might be, what is their uptime?
All of these performance standards are developed
through metrics and key performance indicators.
If you're using something like ITIL
as your IT service management standards,
this is what you're going to be doing as you're trying
to run those inside your organization as well.
Finally, here we wanted to find how we manage and measure
the high-availability solutions for ourselves.
Metrics are going to be really useful to quantify success,
if you develop those metrics correctly.
Decision-makers and leaders love seeing metrics.
They love seeing charts and seeing the performance,
and how it's going up over time,
and how our availability is going up,
and how our costs are going down.
Those are all good things,
but if you don't know what you're measuring
or why you're measuring it,
it really goes back to your performance standards.
Then, these are the kind of things
that are wasting your time with metrics.
A lot of people measure a lot of things,
and they don't really tell you
if you're getting the outcome you're wanting.
I want to make sure that you think about
how you decide on what metrics you're going to use.
Now, we've covered a lot of different design criteria
in this lesson, but the real big takeaway here
that I want you to think about is this.
If you have an existing network,
you can add availability to it,
and you can add redundancy to it.
You can retrofit stuff in,
but it's going to cost you a lot more time
and a lot more money.
It is much, much cheaper
to design this stuff early in the process
when you start building a network from scratch.
So, if you're designing a network and you're asked early on
what kind of things you need,
I want you to think about all these things of redundancy
in your initial design.
Adding them in early is going to save you a lot of money.
Every project has three main factors,
time, cost, and quality,
and usually, one of these things is going to suffer
at the expense of the other two.
For example, if I asked you to build me a network
and I want it to be fully redundant
and available by tomorrow, could you do it?
Well, maybe, but it's probably going to cost me a lot of money,
and because they give you very little time,
it's going to cost me even more,
or your quality is going to suffer.
So, you could do it good, you could do it quick,
or you could do it cheap, but you can't do all three.
It's always going to be a trade-off between these three things,
and I want you to remember
as you're out there and you're designing networks,
you need to make sure you're thinking about your redundancy
and your availability and your reliability,
because often that quality is going to suffer
in favor of getting things out quicker
or getting things out cheaper.
Recovery sites.
In this lesson, we're going to discuss the concept
of recovery sites.
After all things are going to break and your networks
are going to go down.
This is just a fact of life.
So what are you going to do when it comes time
to recover your enterprise network?
Well, that's what we're going to discuss in this lesson.
When it comes to designing redundant operations
for your company,
you really should consider a recovery site.
And with recovery sites, you have four options.
You see, you can have all the software and hardware
redundancy you want.
But at the end of the day,
sometimes you need to actually recover your site too.
Now this could be because there's a fire that breaks out
in your building or a hurricane or earthquake.
All of these things might require you to relocate
and if you do, you're going to have to choose
one of four options.
This could be a cold site, a warm site, a hot site
or a cloud site.
Now when we deal with cold sites,
this means that you have a building that's available
for you to use,
but you don't have any hardware or software in place.
And if you do, those things aren't even configured.
So you may have to go out to the store and buy routers
and switches and laptops and servers
and all that kind of stuff.
You're going to bring it to a new building, configure it
and then restore your network.
This means that while recovery is possible,
it's going to be slow and it's going to be time consuming.
If I have to build you out a new network and a cold site,
that means I'm going to need you to bring everything in
after the bad thing has already happened,
such as your building catching fire.
And this can take me weeks or even months
to get you fully backing up and running.
Now, the biggest benefit of using a cold site
is that it is the cheapest option
that we're going to talk about.
The drawbacks are that it is slow and essentially
this is just going to be an empty building
that's waiting for you to move in and start rebuilding.
Now next, we have a warm site.
A warm site means you have the building available
and it already contains a lot of the equipment.
You might not have all your software installed
on these servers or maybe you don't have the latest security
patches or even the data backups from your other site
haven't been recovered here yet.
But you do already have the hardware
and the cabling in place.
With a warm site,
we already have a network that's running the facility.
We have switches and routers and firewalls.
But we may not maintain it fully
each and every day of the year.
So, when a bad event happens
and you need to move into the warm site,
we can load up our configurations on our routers
and switches, install the operating systems on the servers,
restore the files from backup
and usually within a couple of days,
we can get you back up and running.
Normally with a warm site,
we're looking to restore the time between 24 hours
and seven days.
Basically, under a week.
Recovery here is going to be fairly quick,
but not everything from the original site
is going to be there and ready for all employees
at all times.
Now, if speed of recovery is really important to you,
the next type of site is your best choice.
It's known as a hot site.
Now hot site is my personal favorite.
But it's also the most expensive to operate.
With a hot site, you have a building, you have the equipment
and you have the data already on site.
That means everything in the hot site is up and running
all the time.
Ready for you to instantly switch over your operations
from your primary site to your hot site
at the flip of a switch.
This means you need to have the system and network
administrators working at that hut site every day
of the year, keeping it up and running, secured
and patched and ready for us to take over operations
whenever we're told to.
Basically, your people are going to walk out of the old site,
get in their car, drive to the new site, login
and they're back to work as if nothing ever happened.
This is great because there's very minimal downtime.
And you're going to have nearly identical levels of servers
at the main site in the hut site.
But as you can imagine, this costs a lot of money.
Because I have to pay for the building,
two sets of equipment, two sets of software licenses
and all the people to run all this stuff.
You're basically running two sites at all times.
Therefore, a hot site gets really expensive.
Now a hot site is very critical
if you're in a high availability type of situation.
Let's say you work for a credit card processing company.
And every minute they're down cost them millions of dollars.
They would want to have a hot site, right.
They don't want to be down for three or four weeks.
So they have to make sure they have their network up
and available at all times.
Same thing if you're working for the government
or the military,
they always need to make sure they're operating
cause otherwise people could die.
And so they want to make sure that is always up and running.
That's where hot sites are used.
Now if you can get away from those type of criticality
requirements though, which most organizations can.
You're going to end up settling on something like a warm site,
because it's going to save you on the cost of running
that full recovery hot site.
Now the fourth type of site we have
is known as a cloud site.
Now a cloud site isn't exactly a full recovery site,
like a cold warm or hot site is.
In fact, there may be no building for you to move
your operations into.
Instead, a cloud site is a virtual recovery site
that allows you to create a recovery version
of your organization's network in the cloud.
Then if disaster strikes, you can shift all your employees
to telework operations by accessing that cloud site.
Or you can combine that cloud site with a cold or warm site.
This allows you to have a single set of system
administrators and network administrators
that run your day to day operational networks
and they can also run your backup cloud site.
Because they can operate at all
from wherever they're sitting in the world.
Now cloud sites are a good option to use,
but you are going to be paying a cloud service provider
for all the compute time, the storage
and the network access required to use that cloud site
before, during and after the disastrous event.
So, which of these four options should you consider?
Well, that really depends on your organization.
It's recovery time objectives, the RTO
and its recovery point objectives, RPO.
Now the recovery time objective or RTO
is the duration of time and service level
within which a business process has to be restored
after disaster happens in order to avoid unacceptable
consequences associated with a breaking continuity.
In other words, our RTO is going to answer our question,
how much time did it take for the recovery to happen
after the notification of a business process disruption?
So, if you have a very low RTO,
then you're going to have to use either a hot site
or a cloud site because you need to get up and running
quickly.
That is the idea of a low RTO.
Now on the other hand, we have to think about our RPO.
Which is our recovery point objective.
Now RPO is going to be the interval of time that might pass
during the disruption before the quantity of data loss
during that period exceeds the business continuity plans
maximum allowable threshold or tolerance.
Now RPO is going to determine the amount of data
that will be lost or will have to be re-entered
during network operations in downtime.
It symbolizes the amount of data that can be acceptably lost
by the organization.
For example, in my company we have an RPO of 24 hours.
That means if all of our servers crashed right now,
I as the CEO have accepted the fact that I can lose no more
than the last 24 hours worth of data and that would be okay.
To achieve this RPO,
I have daily backups that are conducted every 24 hours.
So, we can ensure we always have our data backed up
and ready for restoral at any time.
And that means we will lose at most 24 hours worth of data.
The RTO that recovery time objective is going to be focused
on the real time that passes during a disruption.
Like if you took out a stopwatch and started counting.
For example, can my business survive
if we're down for 24 hours?
Sure.
It would hurt, we would lose some money, but we can do it.
How about seven days?
Yeah, again, we would lose some money,
we'd have some really angry students,
but we could still survive.
Now, what about 30 days?
No way.
Within 30 days all of my customers and students,
they would have left me.
They would take their certifications
through some other provider out there
and I would be out of business.
So I had to figure out what my RTO someplace between one
and seven days to make me happy.
So that's the idea of operational risk tolerance,
we start thinking about this from an organizational level.
How much downtime are you willing to accept?
Based on my ability to accept seven days,
I could use a warm site instead of a hot site.
But if I currently accept 24 hours of downtime
or five minutes of downtime,
then I would have to use a hot site instead.
RTO is used to designate that amount of real time
that passes on the clock before that disruption
begins to have serious and unacceptable impedances
to the flow of our normal business operations.
That is the whole concept here with RTO.
Now when we start talking about RPO and RTO,
you're going to see this talked about a lot in backups
and recovery as well.
When you deal with backups and recovery,
you a few different types of backups.
We have things like full backups, incremental backups,
differential backups and snapshots.
Now a full backup is just what it sounds like.
It's a complete backup of every single file on a machine.
It is the safest and most comprehensive backup method,
but it's also the most time consuming and costly.
It's going to take up the most disk space
and the most time to run.
This is normally going to be run on your servers.
Now another type of backup we have
is known as an incremental backup.
With an incremental backup, I'm going to back up the data
that changed since the last backup.
So, if I did a full backup on Sunday
and I go to do an incremental backup on Monday,
I'm only going to back up the things that have changed
since doing that full backup on Sunday.
Now another type we have is known as a differential backup.
A differential backup is only going to back up the data
since the last full backup.
So, let's go back to my example
of Sunday being a full backup
and then I did an incremental backup on Monday.
Then that backup is going to copy everything since Sunday.
But if I do an incremental on Tuesday, it's only going to do
the difference between Monday and Tuesday.
Cause Monday was the last backup on the incremental backup.
When I do it Wednesday,
I'm going to get from Tuesday to Wednesday.
And so when I do these incrementals,
I now have a bunch of smaller pieces
that to put back together when I want to restore my servers.
Now at differential on the other hand is going to be
the entire difference since the last full backup.
So if on Wednesday I did a differential backup,
I'm going to have all the data that's different from Sunday,
the last full backup all the way up through Wednesday.
This is the difference between the differential
and an incremental.
So if I do a full backup on Sunday
and then I do a differential on Monday.
Monday I did an incremental and the differential,
they're going to look the exact same.
But on Tuesday the incremental is only going to include
the stuff since Monday.
But the differential will include everything since Sunday.
This includes all of Monday and Tuesdays changes.
And so you can see how this differential is going to grow
throughout the week until I do another full backup
on the next Sunday.
Now I do an incremental, it's only that last 24 hour period.
Now the last type of backup we have is known as a snapshot.
Now if you're using virtualization
and you're using virtual machines,
this becomes a read only copy of your data frozen in time.
For example, I use snapshots a lot when I'm using virtual
machines or I'm doing malware analysis.
I can take a snapshot on my machine,
which is a frozen instant time.
And then I can load the malware and all the bad things
I need to do.
And then once I'm done doing that,
I can restore back to that snapshot which was clean
before I installed all the malware.
This allows me to do dynamic analysis of it.
Now if you have a very large Sand array or storage area
or network array,
you can take snapshots of your servers
and your virtual machines in a very quick and easy way
and then you'll be able to restore them exactly back
to the way they were at any given moment in time.
Now when we use full, incremental and differential,
most of the time those are going to be used with tape backups
and offsite storage.
But if you're going to be doing snapshots,
that's usually done to a disc like a storage area array.
Now, in addition to conducting your backups of your servers,
it's also important to conduct backups
of your network devices.
This includes their state and their configurations.
The state of a network device contains all the configuration
and dynamic information from a network device
at any given time.
If you export the state of a network device,
it can later be restored to the exact same device
or another device of the same model.
Similarly, you can backup just the configuration information
by conducting a backup of the network device configuration.
This can be done using the command line interface
on the device or using third-party tools.
For example, one organization I worked for
had thousands of network devices.
So we didn't want to go around and do a weekly configuration
backup for all those devices individually.
Instead, we configure them to do that using the tool
known as SolarWinds.
Now once a week, the SolarWinds tool would back up
all the configurations and store them
on a centralized server.
This way, if we ever had a network device that failed,
we could quickly install a spare from our inventory,
restore the configurations from SolarWinds
back to that device and we will be back online
in just a couple of minutes.
Facilities support.
In this lesson, we're going to discuss the concept
of facilities and infrastructure support
for our data centers and our recovery sites.
To provide proper facility support,
it's important to consider power, cooling,
and fire suppression.
So we're going to cover uninterrupted power supplies,
power distribution units, generators, HVAC,
and fire suppression systems in this lesson.
First, we have a UPS, or uninterruptible power supply.
Now an uninterruptible power supply,
or uninterruptible power source,
is an electrical apparatus
that provides emergency power to a load
whenever the input power source or main power
is going to fail.
Most people think of these as battery backups,
but in our data centers and telecommunication closets,
we usually see devices
that contain more than just a battery backup.
For our purposes, we're going to use an UPS
that is going to provide line conditioning
and protect us from surges and spikes in power.
Our goal in using an UPS
is to make sure that we have clean reliable power.
Now an UPS is great for short duration power outages,
but they usually don't last more than about 15 to 30 minutes
because they have a relatively short battery life.
The good news is the batteries
are getting better and better every day
and their lives are getting longer and longer
in newer units.
Second, we have power distribution units or PDUs.
Now a power distribution unit
is a device fitted with multiple outputs
designed to distribute electrical power,
especially to racks of computers
and networking equipment located within our data centers.
PDUs can be rack-mounted
or they can take the form of a large cabinet.
In large data center,
you're usually going to see these large cabinets,
and in general,
there's going to be one PDU for each row of servers
and it maintains the high current circuits,
circuit breakers,
and power monitoring panels inside of them.
These PDUs can provide power protection from surges,
spikes, and brownouts,
but they are not designed
to provide full blackout protection like an UPS would
because they don't have battery backups.
Generally, a PDU be combined with an UPS or a generator
to provide that power that is needed during a blackout.
Third, we have generators.
Now large generators are usually going to be installed
outside of a data center
in order to provide us with longterm power
during a power outage inside your region.
These generators can be powered by diesel,
gasoline, or propane.
For example, at my office,
I have a 20,000 kilowatt diesel generator
that's used to provide power in case we have a power outage.
Now the big challenge with a generator though,
is that they take time to get up to speed
until they're ready to start providing power
to your devices.
They can take usually between 45 to 90 seconds.
So you usually need to pair them up
with a battery backup or UPS
as you're designing your power redundancy solution.
For example, at my office, if the power goes out,
the UPS will carry the load for up to 15 minutes.
During that time,
the generator will automatically be brought online,
usually taking 45 to 90 seconds.
Once that generator is fully online,
and providing the right stable power,
and it's ready to take the load,
the power gets shifted
from the UPS batteries to the generator,
using an automatic transfer switch or ATS.
Now once the power is restored in our area
for at least five minutes being steady,
then our ATS will actually shift power back to the grid
through our UPS unit, that battery backup,
and then shut down our generator.
Fourth, we have HVAC units.
HVAC stands for heating, ventilation, and air conditioning.
Our data centers
are going to generate a ton of heat inside of them
because of all these servers, and switches,
and routers, and firewalls,
that are doing processing inside of them.
To cool down these devices,
we need to have a good HVAC system.
Now to help with this cooling,
most data centers are going to utilize
a hot and cold aisle concept.
Now in the simplest form,
each row of servers is going to face another row of servers.
These two rows
will have the front of the servers facing each other
and the rear of the servers facing away from the aisle.
This is because the servers are designed
to push air out the rear of the device.
So the front of the servers is in the cold aisle
and the rear of the servers is in the hot aisle.
This lets us focus our HVAC systems into the hot aisles
to suck that hot air out,
cool it down, and return it back to the cold aisle,
where it can then be circulated over the servers once again.
Remember, proper cooling is important to the health
and security of our networks and our devices.
If the network devices start to overheat,
they will shut themselves down
to protect their critical components,
and if those components get overheated for too long,
permanent damage can occur
or it can decrease the life expectancy of those devices.
Now our fifth and final thing we need to discuss
is fire suppression.
In a data center,
we usually have built-in fire suppression systems.
These can include wet pipe sprinklers,
pre-action sprinklers, and special suppression systems.
Now a wet pipe system is the most basic type
of fire suppression system,
and it involves a sprinkler system and pipes
that always contain water in those pipes.
Now in a server room or data center environment,
this is kind of dangerous
because a leak in that pipe could damage your servers
that are sitting underneath them.
In general, you should avoid using a wet pipe system
in and around your data centers.
Instead, you should use a pre-action system
to minimize the risk of accidental release
if you're going to be using a wet pipe system.
With a pre-action system,
both the detector actuator
is going to work like a smoke detector,
and then there's going to be a sprinkler
that has to be tripped also
before the water is going to be released.
Again, using water in a data center,
even in a pre-action system,
is not really a good idea though, so I try to avoid it.
Instead, I like to rely on special suppression systems
for most of my data centers.
This will use something like a clean agent system.
Now a clean agent is something like halocarbon agents
or inert gases,
which released, the agents will displace the oxygen
in the room with that inert gas
and essentially suffocate the fire.
Now, the danger with using
a special suppressant system like this
is that if there's people working in your data center,
those people can suffocate
if the clean agent is being released.
So your data center needs to be equipped with an alarm
that announces when the clean agent is being released,
and you also need to make sure
there's supplemental oxygen masks available
and easily accessible
by any person who's working in that data center
whenever they hear the alarm go off
for that clean agent release.
So remember, when you're designing your data centers
and your primary work environment or your recovery sites,
you need to consider your power,
your cooling, and your fire suppression needs.
Why do we need quality of service or QoS?
Well, nowadays we operate converge networks,
which means all of our networks are carrying voice, data
and video content over the same wire.
We don't have them all separate out like we used to.
We used to have networks for phones and ones for data
and ones for video,
but now everything's riding over the same IP networks.
So, because of this convergence of mediums,
we have these networks
that now have a high level availability
to ensure proper delivery
over all of these different medians,
because we want a phone to work
every time we pick up the phone, right?
Well, by using QoS, we can optimize our network
to efficiently utilize all the bandwidth at the right time
to deliver the right service to our users
and give a success and cost savings.
Now, we want to have an excellent quality of service,
an excellent service for our customers,
and that's what we're going to start doing by using QoS.
So what exactly is QoS?
Well, quality of service enables us
to strategically optimize our network performance
based on different types of traffic.
Previously, we talked about the fact
that we want to categorize our different traffic types.
I might have web traffic and voice traffic and video traffic
and email traffic.
And by categorizing it
and identifying these different types of traffic,
I can then prioritize that traffic and route it differently.
So I might determine how much bandwidth is required
for each of those types of traffic.
And I can efficiently use my wide area network links
and all that bandwidth available, for maximum utilization,
and save me bandwidth costs over time.
This can help me identify
the types of traffic that I should drop
whenever there's going to be some kind of congestion,
because if you look at the average load,
there's always going to be some peaks and some valleys.
And so we want to be able to figure that out.
So for example, here on the screen,
you can see the peaks and the valleys.
The peaks over time,
and we need to be able to categorize things
to fit within our bandwidth limitations.
So for example, if we have things like VoIP,
or voice over IP, or video service,
they need to have a higher priority,
because if I'm talking to you on a phone,
I don't want a high amount of latency.
From checking my bank balance, for instance, though,
I can wait another half a second for the web page to load.
From listening to you talk, that half a second delay
starts sending like an echo,
and it gives me a horrible service level.
So we want to be able to solve that,
and to do that, we use quality of service.
Now there are different categories of quality of service.
There are three big ones known as delay, jitter and drops.
When I talk about delay,
this happens when you look at the time
that a packet travels from the source to the destination,
this is measured in milliseconds,
and it's not a big deal if you're dealing with data traffic,
but if you're dealing with voice or video,
delay is an especially big thing,
especially if you're doing things live,
like talking on the phone or doing a live stream,
or something like that.
Now, jitter is an uneven arrival of packets,
and this is especially bad in voiceover IP traffic,
because you're using something like UDP.
And so if I sing something to you, like, "my name is Jason,"
and you got "Jason my name is,"
it sounds kind of weird, right?
Now, usually it's not big chunks like that,
but instead it's little bits
and you'll hear these glick and glock sounds
that make it jumble up because of that jitter.
And this really sounds bad, and it's a bad user experience
if you're using voiceover IP.
And so jitter is a really bad thing
when you're dealing with voice and video.
Now, the third thing we have is what's known as a drop.
Drops are going to occur during network congestion.
When the network becomes too congested,
the router simply can't keep up with demand,
and the queue starts overflowing,
and it'll start dropping packets.
This is the way it deals with packet loss,
and if you're using TCP, it'll just send it again.
But again, if I'm dealing with VoIP, VoIP is usually UDP.
And so if we're talking
and all of a sudden my voice cuts out like that,
that would be bad too.
That's why we don't want to have packet drop on a VoIP call.
And so we want to make sure that that doesn't happen.
These network drops are something that can be avoided
by doing the proper quality of service as well.
So when we deal with this,
we have to think about effective bandwidth.
What is our effective bandwidth?
This is an important concept.
So let's look at this client and this server.
There's probably a lot more to this network
than what I'm showing you here on the screen,
but I've simplified it down for this example.
Here, you can see I have my client on the left,
and he wants to talk to the server.
So he goes up through the switch,
which uses 100 megabit per second Cat-5 cable.
Then he goes through a WAN link
over a 256 kilobit per second connection
because he's using an old DSL line.
Then that connects from that ISP over a T1 connection
to another router.
That router connects to an E1 connection to another router.
And from that router, it goes down a WAN link
over a 512 kilobit per second connection,
and then down to a switch with a gigabit connection,
down to the server.
Now, what is my effective bandwidth?
Well, it's 256 kilobits per second,
because no matter how fast any of the other links are,
whatever the lowest link is inside of this connection,
that is going to be your effective bandwidth.
So we talked about quality of service categories,
in our next lesson, we're going to be talking about
how we can alleviate this problem
of this effective bandwidth, and try to get more out of it,
because we need to be able
to increase our available bandwidth, but in this example,
we're limited to 256 kilobits,
which is going to be really, really slow for us.
Now, I like to think about effective bandwidth
like water flowing through pipes.
I can have big pipes and I can have little pipes.
And if I have little pipes,
I'm going to get less water per second through it
than if I have a really big pipe.
And so this is the idea, if you think about a big funnel,
it can start to back up on us, right?
That's the concept,
And we have to figure out how we can fix that
by using quality of service effectively,
which we're going to discuss more in the next video.
When we deal with the quality of service categorization,
we first have to ask,
what is the purpose of quality of service?
Now, the purpose of quality of service is all about
categorizing your traffic and putting it into buckets
so we can apply a policy to certain buckets
based on those traffic categories
and then we can prioritize them based on that.
I like to tell stories and use analogies in my classes
to help drive home points.
And so, since we're talking about
quality of service and traffic,
I think it's important to talk about real-world traffic.
I live in the Baltimore, Washington D.C area.
This area is known for having
some really really bad traffic.
Now, to alleviate this they applied the idea
of quality of service to their traffic system.
They have three different categories of cars.
They have the first car, which is general public.
Anybody who gets in the road and starts driving,
they are part of this group.
Then there's another category
called high occupancy vehicles or HOV.
And so, if I'm driving my car
and I have at least two other passengers with me,
I can get into special HOV only lanes
and I can go a little bit faster.
Now the third bucket is toll roads or pay roads.
And you have to pay to get on these roads.
And based on the time of day
and the amount of traffic there is,
they actually increase or decrease the price.
Now, if it's during rush hour, you might pay 5 or $10
to get in one of those special toll lanes.
But, they're going to travel a whole lot faster
than the regular general commuter lanes or those HOV lanes.
Now, what does this mean in terms of quality of service?
Well, it's really the same thing.
We take our traffic and we go, okay, this is web traffic,
and this is email traffic,
and this is voice or video traffic.
And based on those buckets we assign a priority to them.
And we let certain traffic go first
and we let it get there faster.
Now, when we categorize this traffic
we start to determine our network performance based on it.
We can start figuring out the requirements
based on the different traffic types
and whether it's voice or video or data.
If we're starting to deal with voice or video
because there are things like streaming media
especially in real-time like a Skype call
or a Voice over IP service,
I want to have a very low delay
and therefore a higher priority.
This way I can do this stuff
for streaming media and voice services
and prevent those jitters and drops and things like that
that we talked about before.
Now, this is something that I want to make sure
has a good high priority so I can get it through.
Instead if I have something with a low priority
that might be something like web browsing
or non-mission critical data.
For instance, if my employees are surfing on Facebook,
that would be a very low priority.
Or if I deal with email,
email is generally a low priority
when it comes to quality of service.
Now why is that, isn't email important to you?
Well, because most email is done
as a store and forward communication method.
This means when I send email,
it can sit on my server for 5 or 10 minutes
before it's actually sent out to the end-user
and they'll never realize it.
So that's okay.
It can be a low priority, it'll still get there eventually.
But if I did the same thing with VoIP traffic,
even delaying it by half a second or a second,
you're going to hear jitters and bumps and echoes
and that would be a horrible service.
So, we want to make sure you get high quality of service
for VoIP and lower priority for email.
Now that's just the way we have it set up.
You can have it set up however you want
as long as you understand
what your quality of service policy is,
and you understand it, and your users understand it,
this is going to be okay.
The best way to do that is to document it
and share that with your users.
You want to make sure your users understand your policy
because this will help make sure
that they don't have problems
and start reporting that back to your service desk.
You can do this by posting it to your internal website.
You might post as part of your indoctrination paperwork
or whatever method you want.
You want to make sure those users understand it
because they're the ones who are going to be there
surfing Facebook or watching YouTube.
If you've categorized as a low priority,
they're going to think something's broken.
But if they know it's a low priority,
they understand it's not broken
it's just your corporate policy.
Now, if they're going to be surfing
something on the web that's mission critical,
that's a higher priority and it's going to get
preferential treatment with your quality of service,
they should know that too.
This is the idea here.
We have to make sure that they understand
how we categorize our traffic
and what categories those get put into.
Now, what are some ways that we can categorize our traffic?
Well, there's really three different mechanisms you can use.
We have best effort, integrated services,
and differentiated services.
Now, when we use best effort
this is when we don't have any quality of service at all
and so traffic is just first in, first out,
every man for himself.
We're going to do our best and just try to get it there.
There's really no reordering of packets.
There's no shaping.
It's just pretty much new quality of service.
First in, first out, best effort.
The second type is known as integrated services or IntServ.
This is also known as hard QoS.
There are different names for it
depending on what company you're using
and what routers and switches you're using.
But the idea here is,
we're going to make strict bandwidth reservations.
We might say that all web traffic
is going to get 50% of our bandwidth,
VoIP service is going to get 25%,
and video service is going to get the remaining 25%.
Now, by reserving bandwidth
for each of these signaling devices,
we now decide how much is going to be there
for each of those three categories.
Now, when we do a DiffServ or differentiated services,
also known as soft QoS,
those percentages become more of a suggestion.
There's going to be this differentiation
between different data types
but for each of these packets,
it's going to be marked its own way.
The routers and switches can then make decisions
based on those markings
and they can fluctuate traffic a little bit as they need to.
Now, this is referred to as soft QoS
because even though we set web up as maybe 50%,
it's not as much web browsing going on right now
we can actually take away some of that 50%
and give it over to VoIP and increase that from 25 to 35%.
This way, when somebody wants to browse the web,
we can then take back that extra from VoIP
and give it back to web back to that 50% was originally had
based on those markings and based on those categories.
Now, if we were using hard QoS or that integrated services,
even if we allocate 50% for web browsing
and nobody's using web browsing,
we're still going to have 50% sitting there
waiting to serve people for web browsing.
And that's why a lot of companies prefer to use soft QS.
Now, let's take a look at it like this
because I like to use simple charts and graphs
to try to make it easy to understand.
With best effort at the top,
you have no strict policies at all.
And basically, you just make your best effort
at providing everyone a good quality of service.
Now with DiffServ you have less strict policies,
also known as soft QS.
Now it's better than the best effort approach
but it's still not the most efficient
or effective method of providing a good quality of service
to those who really need it.
Now with IntServ approaches
you're going to have more of a hard QoS limit.
This is what we've talked about before.
Now, this is going to give you the highest level of service
to those within strict policies.
And if you need a really strong quality of service level
then IntServ or hard QoS with a strict policies
can really ensure that you get it.
Now, the way I like to look at this
is as bundles of QoS options that we can choose from.
So which of these bundles is really the best?
Well, it depends.
It depends on your network and it depends on your needs.
But most of the time, it's not going to be a best effort
because that's usually going to give you
not as much quality as you're really going to want here.
Now, when we start categorizing our traffic out there
we're going to start using these different mechanisms,
either soft or hard QS, for doing that.
And we can do that using classification and marking.
We can do it through congestion management
and congestion avoidance.
We can use policing and shaping.
And we can also use link efficiency.
All of these choices fall under a soft QoS or hard QoS
depending on your configuration that you've set up
within your network appliances, firewalls, or routers.
As I mentioned before,
we have different ways of categorizing our traffic.
We can do it through classification, marking,
utilizing congestion management, congestion avoidance,
policing and shaping, and link efficiency.
All of these ways, are ways for us to help implement
our quality of service and take us from this to this.
Now, as you can see,
we want to start shaping out those peaks and valleys
using these different mechanisms
to give us a better quality of service.
Now, when we look at the classification of traffic,
traffic is going to be placed
into these different categories.
Now, this is going to be done
based on the type of traffic that it is.
There's email, but even inside of email,
we have many different classes
of information inside of an email.
If you think about email,
we have POP3 traffic, we have IMAP traffic.
We have SMTP traffic. We have Exchange traffic.
Those are four different types right there.
And so we can look at the headers
and we can look at the packet type of information
and we can even use the ports that are being used.
And then we can determine what services
need higher or less priority.
We can then do this, not just across email,
but across all of our traffic.
And by doing this, this classification
doesn't alter any bits in the frame itself or the packet.
Instead, there is no marking inside of there.
It's all based on the analysis of the packet itself,
the ports and the protocols used,
and our switches and routers are going to implement QoS
based on that information.
Now, another way to do this, is by marking that traffic.
With this, we're going to alter the bits within the frame.
Now we can do this inside frames, cells, or packets,
depending on what networks we're using.
And this will indicate how we handle this piece of traffic.
Our network tools are going to make decisions
based on those markings.
If you look at the type of service header,
it's going to have a byte of information or eight bits.
The first three of that eight bits is the IP Precedence.
The next six of that is going to be
the differential control protocol or DSP.
Now you don't need to memorize
how this type of service is done inside the header.
But I do want you to remember one of the ways
that we can do this quality service
is by marking and altering that traffic.
Next, we have congestion management.
And when a device receives traffic
faster than it can be transmitted,
it's going to end up buffering that extra traffic
until bandwidth becomes available.
This is known as queuing.
The queuing algorithm is going to empty the packets
in a specified sequence and ML.
These algorithms are going to use one of three mechanisms.
There is a weighted fair queuing.
There's a low-latency queuing,
or there is a weighted round-robin.
Now let's look at this example I have here.
I have four categories of traffic:
Traffic 1, 2, 3, and 4.
It really doesn't matter what kind of traffic it is,
for our example right now,
we just need to know that there's four categories.
Now, if we're going to be using a weighted fair queuing,
how are we going to start emptying these piles of traffic?
Well, I'm going to take one from 1, one from 2,
one from 3, and one from 4.
Then I'm going to go back to 1 and 2 and 3 and 4.
And we'll just keep taking turns.
Now, is that a good mechanism?
Well, maybe. It depends on what your traffic is.
If column 1, for example, was representing VoIP traffic,
this actually, isn't a very good mechanism,
because it has us to keep waiting for our turn.
So instead, let's look at this low-latency queuing instead.
Based on our categories of 1, 2, 3, and 4,
we're going to assign priorities to them.
If 1 was a higher priority than 2,
then all of 1 would get emptied,
then all of 2 would get emptied,
and then all 3 and then all of 4.
Now this works well to prioritize things like
voice and video.
But if you're sitting in category 3 or 4,
you might start really receiving
a lot of timeouts and drop packets
because it's never going to be your turn.
And you're just going to wait and wait and wait.
Now the next one we have is called the weighted round-robin.
And this is actually one of my favorites.
This is kind of a hybrid between the other two.
Now with a weighted round-robin,
we might say that category 1 is VoIP,
and category 2 is video, category 3 is web,
and category 4 is email.
And so we might say that in the priority order,
1 is going to be highest
and we're going to use a weighted round-robin,
and we might say, we're going to take three
out of category 1, two out of category 2,
and then one out of 3 and one out of 4.
And we'll keep going around that way.
We'll take three, two, one, one, three, two, one, one.
And we keep going.
That way, VoIP traffic is getting a lot of priority.
Video is getting the second highest priority.
And then we start looking at web and email
at the bottom of the barrel,
but they're still getting a turn
every couple of rounds here.
And so that way it becomes a weighted round-robin.
As I said, this is the quality of service mechanism
that I really like to implement inside my own networks.
Next, we have the idea of congestion avoidance.
As new packets keep arriving, they can be discarded
if the output queue is already filled up.
Now, I like to think about this as a bucket.
As you can see here, I have a cylinder on the bottom
and it has a minimum and a maximum.
Now, if it's already at maximum and you try
to put more into the bucket,
it just overflows over the top.
Now to help prevent this, we have what's called
the RED or random early detection.
This is used to prevent this overflow from happening for us.
As the queue starts approaching that maximum,
we have this possibility
that discard is going to happen.
And so we start doing is we start dropping traffic.
Instead of just dropping traffic randomly,
we're going to drop it based on priority,
with the lowest traffic priority getting dropped first.
RED is going to drop packets from the selected queues
based on their defined limits.
Now I might start dropping TCP traffic first
because I know it'll retransmit itself.
Where UDP, if you drop it, it's gone forever.
And so I might keep that in my queue a little bit longer,
so it doesn't get dropped.
Now, that's the idea here with TCP traffic,
even if I drop it, we're going to get that retransmission
and we'll try again.
But with UDP, if it dropped,
you're never going to know about it,
and you're going to have loss of service.
Now, when you're dealing with congestion avoidance,
we're going to try to use the buffer
to our advantage, and be able to use it to help us
get more bandwidth through.
Now, when we start putting all these things together,
we start getting into these two concepts,
known as policing and shaping.
Policing is going to discard packets
that exceed the configured rate limit,
which we like to refer to as our speed limit.
Just like if you're driving down the highway too fast,
you're going to get pulled over by a cop
and you're going to get a ticket.
That's what policing is going to do for us.
Now, we're just going to go and drop you off the network
anytime you're going too fast.
So, drop packets are going to result in retransmissions,
which then creates more bandwidth.
Therefore, policing is only good
for very high-speed interfaces.
If you're using a dial up modem or an ISDN connection,
or even a T1, you probably don't want to use policing.
You're much better off using our second method,
which is known as shaping.
Now, what shaping is going to do for us
is it's going to allow the buffer
to delay traffic from exceeding the configured rate.
Instead of dropping those packets like we did in policing,
we're just going to hold them in our buffer.
Then when it's empty and there's space available,
we're going to start pushing it
over that empty space and start shaping out the packets.
This is why we call it shaping or packet shaping.
Now you can see what this looks like here on the screen.
I have traffic at the top,
and you'll see all those jagged lines going down.
Now, what really happens here in your network
is there's this high period of time,
and there's low periods of time,
because not everything is happening
all the time in an equal amount.
If we do policing, all we did was chop off the tops,
which gave us more retransmissions and was shaping.
Instead, we're going to start filling
in from the bottom, from our queue.
So it keeps up there right towards the speed limit
without going over it.
Again, shaping does a better job
of maximizing your bandwidth,
especially on slow speed interfaces,
like a T1 connection, a dial up,
satellite connections, or ISDN.
Then the last thing we need to talk about here
is link efficiency.
Now there's a couple of things we need to mention
in regard to link efficiency.
The first of which is compression.
To get the most out of your link,
you want to make it the most efficient possible.
And so to do that, we can compress our packets.
If we take our payloads and we compress it down,
that's going to conserve bandwidth
because it's less ones and zeros
that need to go across the wire.
VoIP is a great thing that you can compress
because there's so much extra space
that's wasted inside of voice traffic.
VoIP payloads can actually be reduced
by up to 50% of their original space.
We could take it down from 40 bytes
down to 20 bytes by using compression.
If you think that's good, look at the VoIP header.
I can compress the VoIP header down
from 90 or 95% of its original value.
I can take it from 40 bites down to just two to four bytes.
To do this, we use something called compressed RTP or cRTP.
Now, when I have the original VoIP payload,
as you can see here, I have an IP address,
I have UDP as my packet type,
and I have RTP for its header.
And then I have my voice payload.
I can compress all of that down into just a cRTP,
which consolidates the IP, the UDP,
and the RTP altogether into one.
The voice payload can also be squeezed down
to about half of its size.
Now you're not going to notice a big difference
in your audio quality either by doing this,
this can be utilized on slower speed links
to make the most of your limited bandwidth.
And it's not just for VoIP.
You can do this with other types of data too.
Compression is a great thing to use.
They have devices out there called WAN accelerators.
That focus specifically on compressing your data
before sending it out your WAN link.
The last thing I want to talk about here
is what we call LFI, which is another method
to make more efficient use of your links.
This is known as link fragmentation and interleaving.
Now what this does is if you have a really big packet,
it'll start chopping those up
and take those big packets and fragment them,
and then interleave smaller packets in between them.
This way, it's going to allow you to utilize
those slower speed links to make the most
of your limited bandwidth.
Notice here I have three voice packets,
and one big chunk of data.
Now what the router would do, is it's going to chop up
and put that one small voice piece
and then one small data piece,
and then one small voice piece,
and one small data piece.
That way, the voice doesn't suffer
from huge latency by waiting for that big piece
of data to go through first.
By doing this fragmentation and interleaving,
it allows you to get some of that high priority traffic out
in between those larger data structures as well.
Network availability, in this section of the course,
we're going to talk about network availability.
Now, network availability is a measure
of how well a computer network can respond to connectivity
and performance demands that are being placed upon it.
This is usually going to be quantitatively measured
as uptime, where we count the amount of time
the network was up,
and divide that by the total amount of time
covered by the monitoring period.
For example, if I monitor the network for a full year
and I only had five minutes and 16 seconds of downtime
during that year, this would equate to an uptime of 99.999%.
Now this is known as the five nines of availability.
And it's considered the gold standard
in network availability.
This is considered to be an extremely available
and high quality network,
but there is going to be downtime in your networks.
It's a fact of life, devices fail, connections go down,
and incorrect configurations are sometimes going to be applied.
There are a lot of reasons that downtime occurs,
but our goal is to minimize that downtime
and increase our availability.
In order to reach the highest levels of availability,
we need to build in availability and redundancy
into our networks from the beginning.
We also are going to use quality of service
to ensure that our end-users are happy with the networks
and the services that we're providing them.
So in this section on network availability,
we're really going to focus on two domains.
We're going to see domain two,
which is network implementations,
and domain three, network operations.
In here, we're going to talk about two objectives.
Objective 2.2 and objective 3.3.
Now objective 2.2 states, that you must compare and contrast
routing technologies and bandwidth management concepts.
Objective 3.3 states,
that you must explain high availability
and disaster recovery concepts
and summarize which is the best solution.
So let's get started talking
all about the different ways for us
to increase the availability, reliability,
and quality of service within our networks
in this section of the course.
High availability.
In this lesson,
we're going to talk all about high availability.
Now, when we're talking about high availability,
we're really talking about making sure our systems are up
and available.
Availability is going to be measured in what we call uptime
or how many minutes or hours you're up and available
as shown as a percentage.
Usually, you're going to take the amount of minutes
you were up,
divided by the total amount of minutes in a period,
and that gives you a percentage known as uptime.
Now, we try to maintain what is known as the five nines
of availability in most commercial networks.
This is actually really hard because that's 99.999%.
That means I get a maximum of about five minutes
of downtime per year,
which is not a whole lot of downtime.
In some cloud based networks,
they aim for six nines of availability or 99.99999%.
This equates to just 31 seconds of downtime
each and every year.
Now, as you imagine,
I need more than 31 seconds of downtime
or even five minutes of downtime
fueled up patch my servers and install a new hard drive
or put in a new router or switch when one fails.
So, how do I maintain that high level of availability?
Well, I'm going to do that,
by designing my networks to be highly available.
Now, there are two terms you need to understand
and be familiar with,
when we talk about high availability.
There is availability and reliability,
and these are different things.
When I'm talking about availability,
this is concerned with being up and operational.
When I talk about reliability,
I'm concerned with not dropping packets
inside of my network.
If your network is highly available,
but it's not reliable,
it's not a very good network
because it's dropping things all the time
and isn't doing what it's supposed to.
But conversely,
you can have a really highly reliable network,
but if it's not a highly available one,
nobody can use it either because it's down all the time.
So that wouldn't be good either.
So, let's say I had the most highly reliable network
in the entire world,
but it's only up 20 minutes a year.
That's not going to be any good, right?
So, we want to make sure we balance these two things.
We have to aim for good enough in both areas
to meet our business needs based on the available resources
and the amount of money we have to build our networks.
So, when we measure our different network components,
we have to determine how highly available they are.
And we do that through measurement of MTTR and MTBF.
Now, MTTR is the mean time to repair.
This measure is the average time it takes to repair
a network device when it breaks.
After all,
everything is going to break eventually.
So, when a device breaks,
how long does it take for you to fix it?
And how much downtime are you going to experience?
That is what we're trying to measure
when we deal with the mean time to repair.
Now, the mean time between failures or MTBF,
is going to measure the average time
between when a failure happens on a device
and the next failure happens.
Now, these two terms can often be confusing.
So, let me display it on a timeline and explain a little bit
about what they look like in the real world.
Now, let's say I had a system failure
at this first stop sign here on the left side.
Then we resume normal operations because we fix things.
That amount of time, was the time to repair.
Now, if I average all the times to repair
over the entire year for that type of device,
that's going to give me my MTTR,
my mean time to repair, the average time to repair.
Now, on the failure side of things,
we want to measure the failure time from one failure
thus using it, fixing it,
and then another failure happens.
This becomes the time between the failures.
If I average all those together,
I get the average time between failures
or the mean time between failures, MTBF.
Hopefully, you can see the difference here.
Remember, when we're dealing with mean time to repair,
we want this to be a very small number.
When we deal with the mean time between failures,
we want this to be a very large number.
This means,
and for the very small number for mean time to repair,
we can fix things really quickly
and get ourselves back online.
So, the lower the mean time to repair is,
the better the network availability.
Now, on the other hand,
we start talking about mean time between failures,
we want a really long time
because this means that the device has stayed up
and operational for a very long time before they fail.
This is going to give us better network availability,
and overall, it should give us better reliability too.
Now, we don't want a lot of failures here.
And so the more time in between failures,
the better that is for our network.
So, how do we design these networks
to be highly reliable and highly available?
Well, we're going to add redundancy to our networks
and their devices.
Now, redundancy can be achieved through a single device
or by using multiple devices.
If you're using a single device,
you're still going to have single points of failure
in your network,
but it is cheaper than being fully hardware redundant.
Let's take a look at this concept for a moment.
Here you could see a single point of failure in my network.
Even though I have two switches and multiple connections
between those switches,
which gives me additional redundancy,
that router is not giving me additional redundancy.
It's a single point of failure
because it's the only router I have.
So, even if the router has internal hardware redundancy,
like two power supplies and two network cards,
I still only have one router chassis and one circuit board
running in that router.
So, if that router goes down,
this entire network is going to stop.
Therefore, this is considered a single point of failure.
Now instead,
I could redesign the network
and I can increase its redundancy by doing this.
Notice, I now have two PCs that want to talk to each other.
And each of them has dual network interface cards
talking to two different switches.
And each of those switches talks to two different routers.
Everything is connected to everything else
in a mesh topology for these network devices.
This gives me multiple connections between each device
and provides me with link redundancy, component redundancy,
and even inside those devices,
I may have two network cards, two power supplies,
and to every other internal network component there is,
so that I have a very redundant
and highly available network.
Now, if one of those routers needs to be upgraded,
I can take it offline and update its firmware,
and then the entire time that second router
is still on the network,
maintaining the load and providing service to all the users.
Then I can put the first router back on the network,
take off the second router and then do its upgrades.
By doing this and taking turns,
I still am able to have network functions run,
and I have no downtime associated with this network.
This is how you keep a network highly available.
Now, let's talk a little bit more about hardware redundancy.
Inside these routers and other network devices,
we can have hardware redundancy or the devices themselves
could be hardware redundant.
Now, if I have two routers and they're both
serving the same function,
this is considered hardware redundancy.
I could also have hardware redundancy in the components
by having two network cards or two hard drives
or two internal power supplies on a single device.
That way, if one of them fails, the second one takes over.
Usually, you're going to find this
in strategic network devices,
things like your switches, your routers, your firewalls,
and your servers,
because you can't afford a failure
in any one of those devices,
because they would take down large portions
of your network or its services.
On the other hand, if I'm considering my laptop,
I only have one hard drive in it.
If that laptop fails or that hard drive fails,
I would just deal with that downtime.
I might buy a new laptop or a new hard drive
and then restore from an old backup.
That would get me back up and running.
Now, when we're working with end-user devices
like workstations and clients,
we often don't deal with redundancy.
But when you start getting to servers and routers
and switches and firewalls,
you need to start having hardware
and component level redundancy
because these serve lots of end-users.
We deal with this redundancy,
we can then cluster our devices and have them work either
an active-active,
or active-passive configuration.
All right.
Let's assume I have this one computer
and it has two network interface cards
that are connected to the network.
Do I want to talk to both routers at the same time?
Well, if I'm active-active,
then both of those network interface cards
are going to be active at the same time,
and they each are going to have their own Mac address,
and they're going to be talking at the same time
to either of these two routers.
This can then be done to increase the bandwidth
of this computer and load balance
across both network interface cards.
This is known as Network Interface Card teaming,
or NIC teaming,
where a group of network interface cards,
is used for load balancing and failover for a server
or another device like that.
Now, on the other hand,
we can use active-passive,
and this is going to have a primary
and a backup network interface card.
Now, one of these cards is going to be active
and being used at all times.
And when it fails,
the other card is going to go from standby and take over.
In this case,
there is no performance increased by having two cards,
but I have true redundancy and failover capabilities.
In an active-passive configuration,
both NICs are going to be working together
and they're going to have a single Mac address
that they're going to display to the network,
so they look like they're a single device.
Now, when you start looking at redundancy at layer three,
we're going to start talking about our routers.
Now here,
our clients are getting configured with a default gateway,
which is our router by default.
But, if the default gateway went down,
we wouldn't be able to leave the sub-net,
and so it'd be stuck on the internal network.
Now, we don't want that.
So instead,
we want to add some redundancy
and we'll use layer three redundancy
using a virtual gateway.
To create a virtual gateway,
we need to use either
the First Hop Redundancy Protocol, FHRP,
or the Virtual Router Redundancy Protocol, VRRP.
Now, the most commonly used First Hop Redundancy Protocol
is known as HSRP or the Hot Standby Router Protocol.
This is a layer three redundancy protocol
that's used as a proprietary First Hop Redundancy Protocol
in Cisco devices.
HSRP is going to allow for an active and a standby router
to be used together.
And instead,
we get a virtual router that's defined
as our default gateway.
The client devices like the workstations and servers
are then going to be configured to use the virtual router
as its gateway.
When the PC communicates to the IP of the virtual router,
the router will determine which physical router is active
and which one is standby.
And then, it forwards the traffic to that active router.
If the active router goes down,
the standby router will pick up the responsibility
for that active router
until the other router comes back online
and takes over its job again.
Now, with VRRP,
the Virtual Router Redundancy Protocol,
this is one that was created
by the internet engineering task force.
Is an open standard variant
of the Hot Standby Router Protocol or HSRP.
VRRP allows for one master or active router,
and the rest can then be added in a cluster as backups.
Unlike HSRP,
where you can only have one router as active,
and one as standby.
VRRP is going to allow you to have multiple standby routers.
Just like HSRP,
you're going to configure the VRRP
to create a virtual router
that's going to be used as a default gateway
for all of your client devices.
Now, in order to provide load balancing on your networks
and to increase both redundancy
and performance of your networks,
you can use GLBP,
which is the Gateway Load Balancing Protocol,
or you can use LACP the Link Aggregation Control Protocol.
Now, GLBP or the Gateway Load Balancing Protocol
is a Cisco protocol,
and it's another proprietary First Hop Redundancy Protocol.
Now, GLBP will allow us to create a virtual router
and that virtual router will have two routers
being placed behind it,
in active and standby configuration.
The virtual router or gateway will then forward traffic
to the active or standby router
based on which one has the lower current loading,
when the gateway receives that traffic.
If both can support the loading,
then the GLBP will send it to the active
since it's considered the primary device.
By using GLBP,
you can increase the speeds of your network
by using load balancing between two routers or gateways.
Now, the second thing we can use,
is LACP or Link Aggregation Control Protocol.
This is a redundancy protocol that's used at layer two.
So we're going to be using this with switches.
LACP is going to achieve redundancy by having multiple links
between the network devices,
where load balancing over multiple links can occur.
The devices are all going to be considered
part of a single combined link
even when we have multiple links.
This gives us higher speeds and increases our bandwidth.
For example,
let's pretend I have four Cat5 cables.
Each of these are connected to the same switch.
Now, each of those cables has 100 megabits per second
of bandwidth.
Now, if I use the Link Aggregation Control Protocol,
I can bind these altogether and aggregate them
to give me 400 megabits per second
of continuous bandwidth by creating
this one single combined bandwidth,
from those four connections.
Now, let's consider what happens
when that is trying to leave our default gateway
and get out to the internet.
Now in your home,
you probably only have one internet connection,
but for a company,
you may wish to have redundant paths.
For example, at my office,
we have three different internet connections.
The first is a microwave link that operates
at 215 megabits per second for uploads and downloads.
The second, is a cable modem connection.
It operates at 300 megabits per second for downloads
and 30 megabits per second for uploads.
Now the third is a cellular modem,
and that gives me about 100 megabits per second
for downloads and about 30 megabits per second for uploads.
Now, the reason I have multiple connections,
is to provide us with increased speed and redundancy.
So, to achieve this,
I take all three of these connections
and connect them to a single gateway
that's going to act as a load balancer.
If all the connections are up and running,
they're going to load balance my traffic
across all three of those connections
to give me the highest speeds at any given time.
But, if one of those connections drops,
the load balancer will remove it from the pool
until it can be returned to service.
By doing this,
I can get a maximum speed of about 615 megabits per second
for a combined download.
And on the upload,
I can get about 310 megabits per second,
when using all three connections
and they're all functioning in online.
Similarly,
you might be in an area where you can get fiber connections
to your building.
Now, in those cases,
you may purchase a primary and a backup connection.
And if you do,
you should buy them from two different providers.
If both of your connections are coming from the same company
and they go down,
well, guess what?
You just lost both of your connections
because the upstream ISP might be down.
For this reason,
it's always important to have diversity in your path
when you're creating link redundancy,
just like I did in my office.
I have a microwave connection through one ISP.
I have a cable modem through another ISP,
and I have a cellular modem through a third ISP.
That way,
if anyone goes down,
I still have two other paths I can use.
Now, the final type of redundancy that we need to discuss,
is known as multipathing.
Multipathing is used in our storage area networks.
Multipathing is used to create more than one physical path
between the server and its storage devices.
And this allows for better fault tolerance
and performance enhancements.
Basically, think of multipathing
as a form of link aggregation,
but instead of using it for switches,
we're going to use it for our storage area networks.
In the last lesson I showed you a couple of diagrams
of redundant networks,
but one of the things we had to think about in this lesson
are the considerations we have
when we start designing these redundant networks.
First you need to ask yourself,
are you going to use redundancy in the network,
and if so, where, and how?
So are you going to do it from a module or a parts perspective?
For instance, are you going to have multiple power supplies,
multiple network interface devices, multiple hard drives,
or are you going to look at it more from a chassis redundancy
and have two sets of routers or two sets of switches?
These are things you have to think about.
Which one of these are you going to use,
because each one is going to affect the cost
of your network, based on the decisions you make.
You have to be able to make a good business case
for which one you're going to use, and why.
For instance, if you could just have
a second network interface card or a second power supply,
that's going to be a lot cheaper
than having to have an entire switch
or an entire extra router there.
Now, each of those switches and routers,
some of these can cost 3 or 4 or $5,000,
and so it might be a lot cheaper
to have a redundant power supply, right,
and so these are the things you have to think about
and weigh as you're building your networks.
Another thing you have to think about
is software redundancy,
and which features of those are going to be appropriate.
Sometimes you can solve a lot of these redundancy problems
by using software as opposed to hardware.
For example, if you have a virtual network setup,
you could just put in a virtual switch
or a virtual router in there,
and that way you don't have to bring
another real router or real switch in,
that can save you a lot of money.
There's also a lot of other software solutions out there,
like a software RAID,
that will give you additional redundancy
for your storage devices,
as opposed to putting in an extra hard drive chassis,
or another RAID array or storage area network.
Also, these are the types of things
you have to be thinking about
as you're building out your network, right?
When you think about your protocols,
what protocol characteristics
are going to affect your design requirements?
This is really important if you're designing things,
and you're using something like
TCP versus UDP in your designs,
because TCP has that additional redundancy
by resending packets, where UDP doesn't,
this is something you have to consider as well.
As you design all these different things,
all of these different factors are going to work together,
just like gears, and each one turns another,
and each one is going to feed another one,
and the more reliability and availability
you get in your networks
by adding all these components together.
In addition to all this,
there are other design considerations
that we have to think about as well,
like what redundancy features should we use
in terms of powering the infrastructure devices?
Are we going to have internal power supplies
and have two of those, and have them redundant?
Or, are we going to have battery backups, or UPSs,
are we going to have generators?
All of these things are things you have to think about,
and I don't have necessarily the right answers for you,
because it all comes down to a case-by-case basis.
Every network is going to be different,
and every one has its own needs
and its own business case associated with it.
The network that I had at former employers
were serving hundreds of thousands of clients,
and those were vastly different than the ones
that are servicing my training company right now,
with just a handful of employees.
Because when you're dealing with your network design
and your redundancies,
you have to think about the business case first.
Each one is going to be different
based on your needs and your considerations.
What redundancy features should be used
to maintain the environmental conditions of your space?
If you have good power and space and cooling,
you need to make sure
that you're thinking about air conditioning,
and do you have one unit or two?
Do you have generators onsite?
Do you have additional thermal heating or thermal cooling?
All of these things are things you have to think about.
What do you do when power goes down?
What are some of those things
that you're going to have to deal with
if you're running a server farm
that has to have units running all the time,
because it can't afford to go down
because it's going to affect
thousands and thousands of people,
instead of just your one office with 20 people?
All of these are things you have to consider
as you think about it.
In my office, we made the decision
that one air conditioning unit was enough,
because if it goes down, we might just not work today
and we'll come to work tomorrow, we can get over that.
But in a server farm,
we need to make sure we have multiple air conditioners,
because if that goes down
it can actually burn up all the components, right?
So we have to have additional power and space and cooling
that are fully redundant,
because of that server infrastructure
that we're supporting there.
These are the things you have to balance in your practices.
And so when you start looking at the best practices,
I want you to examine your technical goals
and your operational goals.
Now what I mean by that is,
what is the function of this network?
What are you actually trying to accomplish?
Are you trying to get to 90% uptime, or 95%, or 99%,
or are you going for that gold standard
of five nines of availability?
Every company has a different technical goal,
and that technical goal is going to determine
the design of your network.
And you need to identify that
inside of your budgeting as well,
because funding these high-availability features
is really expensive.
As I said, if I want to put a second router in there,
that might cost me another 3,000 or $5,000.
In my own personal network,
we have a file server, and it's a small NAS device.
We're not comfortable just having all of our devices there,
so we decided we weren't comfortable
having all of our file storage on a single hard drive,
and we built this NAS array instead,
so if one of those drives goes out,
we have three others that are carrying the load.
This is the idea here.
Now, eventually we decided we didn't need that NAS anymore,
and so we replaced that NAS enclosure with a full RAID 5.
Later on we took that full RAID 5
and we switched it over to a cloud server
that has redundant backups
in two different cloud environments.
And so all of these things work together
based on our decisions,
but as we moved up that scale
and got more and more redundancy,
we have more and more costs associated.
It was a lot cheaper just to have an 8-terabyte hard drive
with all of our files on it,
then we went to a NAS array
and that cost two or three times that money,
then we went to a full RAID 5
and that cost a couple more times that,
then we went to the cloud and we have to pay more for that.
Remember, all your decisions here
are going to cost you more money,
but if it's worth it to you, that would be important, right,
and so these are the things you have to balance
as you're designing these fully redundant networks,
based on those technical goals.
You also need to categorize
all of your business applications into profiles,
to help with this redundancy mission
that you're trying to go and accomplish here.
This will really help you as you start
going into the quality of service as well.
Now if I said, for instance,
that web is considered category one
and email is category two
and streaming video's going to be category three,
then we can apply profiles
and give certain levels of service
to each of those categories.
Now we'll talk specifically of how that works
when we talk about quality of service in a future lesson.
Another thing we want to do
is establish performance standards
for our high-availability networks.
What are the standards that we're going to have to have?
These standards are going to drive
how success is measured for us,
and in the case of my file server, for instance,
we measure success as it being up and available
when my video editors need to access it,
and that they don't lose data,
because if we lost all of our files,
that'd be bad for us, right?
Those are two metrics that we have,
and we have numbers associated with each of those things.
In other organizations, we measure it based on the uptime
of the entire end-to-end service,
so if a client can't get out to the internet for an ISP,
that would be a bad thing, that's one of their measurements.
Now the other one might be, what is their uptime?
All of these performance standards are developed
through metrics and key performance indicators.
If you're using something like ITIL
as your IT service management standards,
this is what you're going to be doing as you're trying
to run those inside your organization as well.
Finally, here we wanted to find how we manage and measure
the high-availability solutions for ourselves.
Metrics are going to be really useful to quantify success,
if you develop those metrics correctly.
Decision-makers and leaders love seeing metrics.
They love seeing charts and seeing the performance,
and how it's going up over time,
and how our availability is going up,
and how our costs are going down.
Those are all good things,
but if you don't know what you're measuring
or why you're measuring it,
it really goes back to your performance standards.
Then, these are the kind of things
that are wasting your time with metrics.
A lot of people measure a lot of things,
and they don't really tell you
if you're getting the outcome you're wanting.
I want to make sure that you think about
how you decide on what metrics you're going to use.
Now, we've covered a lot of different design criteria
in this lesson, but the real big takeaway here
that I want you to think about is this.
If you have an existing network,
you can add availability to it,
and you can add redundancy to it.
You can retrofit stuff in,
but it's going to cost you a lot more time
and a lot more money.
It is much, much cheaper
to design this stuff early in the process
when you start building a network from scratch.
So, if you're designing a network and you're asked early on
what kind of things you need,
I want you to think about all these things of redundancy
in your initial design.
Adding them in early is going to save you a lot of money.
Every project has three main factors,
time, cost, and quality,
and usually, one of these things is going to suffer
at the expense of the other two.
For example, if I asked you to build me a network
and I want it to be fully redundant
and available by tomorrow, could you do it?
Well, maybe, but it's probably going to cost me a lot of money,
and because they give you very little time,
it's going to cost me even more,
or your quality is going to suffer.
So, you could do it good, you could do it quick,
or you could do it cheap, but you can't do all three.
It's always going to be a trade-off between these three things,
and I want you to remember
as you're out there and you're designing networks,
you need to make sure you're thinking about your redundancy
and your availability and your reliability,
because often that quality is going to suffer
in favor of getting things out quicker
or getting things out cheaper.
Recovery sites.
In this lesson, we're going to discuss the concept
of recovery sites.
After all things are going to break and your networks
are going to go down.
This is just a fact of life.
So what are you going to do when it comes time
to recover your enterprise network?
Well, that's what we're going to discuss in this lesson.
When it comes to designing redundant operations
for your company,
you really should consider a recovery site.
And with recovery sites, you have four options.
You see, you can have all the software and hardware
redundancy you want.
But at the end of the day,
sometimes you need to actually recover your site too.
Now this could be because there's a fire that breaks out
in your building or a hurricane or earthquake.
All of these things might require you to relocate
and if you do, you're going to have to choose
one of four options.
This could be a cold site, a warm site, a hot site
or a cloud site.
Now when we deal with cold sites,
this means that you have a building that's available
for you to use,
but you don't have any hardware or software in place.
And if you do, those things aren't even configured.
So you may have to go out to the store and buy routers
and switches and laptops and servers
and all that kind of stuff.
You're going to bring it to a new building, configure it
and then restore your network.
This means that while recovery is possible,
it's going to be slow and it's going to be time consuming.
If I have to build you out a new network and a cold site,
that means I'm going to need you to bring everything in
after the bad thing has already happened,
such as your building catching fire.
And this can take me weeks or even months
to get you fully backing up and running.
Now, the biggest benefit of using a cold site
is that it is the cheapest option
that we're going to talk about.
The drawbacks are that it is slow and essentially
this is just going to be an empty building
that's waiting for you to move in and start rebuilding.
Now next, we have a warm site.
A warm site means you have the building available
and it already contains a lot of the equipment.
You might not have all your software installed
on these servers or maybe you don't have the latest security
patches or even the data backups from your other site
haven't been recovered here yet.
But you do already have the hardware
and the cabling in place.
With a warm site,
we already have a network that's running the facility.
We have switches and routers and firewalls.
But we may not maintain it fully
each and every day of the year.
So, when a bad event happens
and you need to move into the warm site,
we can load up our configurations on our routers
and switches, install the operating systems on the servers,
restore the files from backup
and usually within a couple of days,
we can get you back up and running.
Normally with a warm site,
we're looking to restore the time between 24 hours
and seven days.
Basically, under a week.
Recovery here is going to be fairly quick,
but not everything from the original site
is going to be there and ready for all employees
at all times.
Now, if speed of recovery is really important to you,
the next type of site is your best choice.
It's known as a hot site.
Now hot site is my personal favorite.
But it's also the most expensive to operate.
With a hot site, you have a building, you have the equipment
and you have the data already on site.
That means everything in the hot site is up and running
all the time.
Ready for you to instantly switch over your operations
from your primary site to your hot site
at the flip of a switch.
This means you need to have the system and network
administrators working at that hut site every day
of the year, keeping it up and running, secured
and patched and ready for us to take over operations
whenever we're told to.
Basically, your people are going to walk out of the old site,
get in their car, drive to the new site, login
and they're back to work as if nothing ever happened.
This is great because there's very minimal downtime.
And you're going to have nearly identical levels of servers
at the main site in the hut site.
But as you can imagine, this costs a lot of money.
Because I have to pay for the building,
two sets of equipment, two sets of software licenses
and all the people to run all this stuff.
You're basically running two sites at all times.
Therefore, a hot site gets really expensive.
Now a hot site is very critical
if you're in a high availability type of situation.
Let's say you work for a credit card processing company.
And every minute they're down cost them millions of dollars.
They would want to have a hot site, right.
They don't want to be down for three or four weeks.
So they have to make sure they have their network up
and available at all times.
Same thing if you're working for the government
or the military,
they always need to make sure they're operating
cause otherwise people could die.
And so they want to make sure that is always up and running.
That's where hot sites are used.
Now if you can get away from those type of criticality
requirements though, which most organizations can.
You're going to end up settling on something like a warm site,
because it's going to save you on the cost of running
that full recovery hot site.
Now the fourth type of site we have
is known as a cloud site.
Now a cloud site isn't exactly a full recovery site,
like a cold warm or hot site is.
In fact, there may be no building for you to move
your operations into.
Instead, a cloud site is a virtual recovery site
that allows you to create a recovery version
of your organization's network in the cloud.
Then if disaster strikes, you can shift all your employees
to telework operations by accessing that cloud site.
Or you can combine that cloud site with a cold or warm site.
This allows you to have a single set of system
administrators and network administrators
that run your day to day operational networks
and they can also run your backup cloud site.
Because they can operate at all
from wherever they're sitting in the world.
Now cloud sites are a good option to use,
but you are going to be paying a cloud service provider
for all the compute time, the storage
and the network access required to use that cloud site
before, during and after the disastrous event.
So, which of these four options should you consider?
Well, that really depends on your organization.
It's recovery time objectives, the RTO
and its recovery point objectives, RPO.
Now the recovery time objective or RTO
is the duration of time and service level
within which a business process has to be restored
after disaster happens in order to avoid unacceptable
consequences associated with a breaking continuity.
In other words, our RTO is going to answer our question,
how much time did it take for the recovery to happen
after the notification of a business process disruption?
So, if you have a very low RTO,
then you're going to have to use either a hot site
or a cloud site because you need to get up and running
quickly.
That is the idea of a low RTO.
Now on the other hand, we have to think about our RPO.
Which is our recovery point objective.
Now RPO is going to be the interval of time that might pass
during the disruption before the quantity of data loss
during that period exceeds the business continuity plans
maximum allowable threshold or tolerance.
Now RPO is going to determine the amount of data
that will be lost or will have to be re-entered
during network operations in downtime.
It symbolizes the amount of data that can be acceptably lost
by the organization.
For example, in my company we have an RPO of 24 hours.
That means if all of our servers crashed right now,
I as the CEO have accepted the fact that I can lose no more
than the last 24 hours worth of data and that would be okay.
To achieve this RPO,
I have daily backups that are conducted every 24 hours.
So, we can ensure we always have our data backed up
and ready for restoral at any time.
And that means we will lose at most 24 hours worth of data.
The RTO that recovery time objective is going to be focused
on the real time that passes during a disruption.
Like if you took out a stopwatch and started counting.
For example, can my business survive
if we're down for 24 hours?
Sure.
It would hurt, we would lose some money, but we can do it.
How about seven days?
Yeah, again, we would lose some money,
we'd have some really angry students,
but we could still survive.
Now, what about 30 days?
No way.
Within 30 days all of my customers and students,
they would have left me.
They would take their certifications
through some other provider out there
and I would be out of business.
So I had to figure out what my RTO someplace between one
and seven days to make me happy.
So that's the idea of operational risk tolerance,
we start thinking about this from an organizational level.
How much downtime are you willing to accept?
Based on my ability to accept seven days,
I could use a warm site instead of a hot site.
But if I currently accept 24 hours of downtime
or five minutes of downtime,
then I would have to use a hot site instead.
RTO is used to designate that amount of real time
that passes on the clock before that disruption
begins to have serious and unacceptable impedances
to the flow of our normal business operations.
That is the whole concept here with RTO.
Now when we start talking about RPO and RTO,
you're going to see this talked about a lot in backups
and recovery as well.
When you deal with backups and recovery,
you a few different types of backups.
We have things like full backups, incremental backups,
differential backups and snapshots.
Now a full backup is just what it sounds like.
It's a complete backup of every single file on a machine.
It is the safest and most comprehensive backup method,
but it's also the most time consuming and costly.
It's going to take up the most disk space
and the most time to run.
This is normally going to be run on your servers.
Now another type of backup we have
is known as an incremental backup.
With an incremental backup, I'm going to back up the data
that changed since the last backup.
So, if I did a full backup on Sunday
and I go to do an incremental backup on Monday,
I'm only going to back up the things that have changed
since doing that full backup on Sunday.
Now another type we have is known as a differential backup.
A differential backup is only going to back up the data
since the last full backup.
So, let's go back to my example
of Sunday being a full backup
and then I did an incremental backup on Monday.
Then that backup is going to copy everything since Sunday.
But if I do an incremental on Tuesday, it's only going to do
the difference between Monday and Tuesday.
Cause Monday was the last backup on the incremental backup.
When I do it Wednesday,
I'm going to get from Tuesday to Wednesday.
And so when I do these incrementals,
I now have a bunch of smaller pieces
that to put back together when I want to restore my servers.
Now at differential on the other hand is going to be
the entire difference since the last full backup.
So if on Wednesday I did a differential backup,
I'm going to have all the data that's different from Sunday,
the last full backup all the way up through Wednesday.
This is the difference between the differential
and an incremental.
So if I do a full backup on Sunday
and then I do a differential on Monday.
Monday I did an incremental and the differential,
they're going to look the exact same.
But on Tuesday the incremental is only going to include
the stuff since Monday.
But the differential will include everything since Sunday.
This includes all of Monday and Tuesdays changes.
And so you can see how this differential is going to grow
throughout the week until I do another full backup
on the next Sunday.
Now I do an incremental, it's only that last 24 hour period.
Now the last type of backup we have is known as a snapshot.
Now if you're using virtualization
and you're using virtual machines,
this becomes a read only copy of your data frozen in time.
For example, I use snapshots a lot when I'm using virtual
machines or I'm doing malware analysis.
I can take a snapshot on my machine,
which is a frozen instant time.
And then I can load the malware and all the bad things
I need to do.
And then once I'm done doing that,
I can restore back to that snapshot which was clean
before I installed all the malware.
This allows me to do dynamic analysis of it.
Now if you have a very large Sand array or storage area
or network array,
you can take snapshots of your servers
and your virtual machines in a very quick and easy way
and then you'll be able to restore them exactly back
to the way they were at any given moment in time.
Now when we use full, incremental and differential,
most of the time those are going to be used with tape backups
and offsite storage.
But if you're going to be doing snapshots,
that's usually done to a disc like a storage area array.
Now, in addition to conducting your backups of your servers,
it's also important to conduct backups
of your network devices.
This includes their state and their configurations.
The state of a network device contains all the configuration
and dynamic information from a network device
at any given time.
If you export the state of a network device,
it can later be restored to the exact same device
or another device of the same model.
Similarly, you can backup just the configuration information
by conducting a backup of the network device configuration.
This can be done using the command line interface
on the device or using third-party tools.
For example, one organization I worked for
had thousands of network devices.
So we didn't want to go around and do a weekly configuration
backup for all those devices individually.
Instead, we configure them to do that using the tool
known as SolarWinds.
Now once a week, the SolarWinds tool would back up
all the configurations and store them
on a centralized server.
This way, if we ever had a network device that failed,
we could quickly install a spare from our inventory,
restore the configurations from SolarWinds
back to that device and we will be back online
in just a couple of minutes.
Facilities support.
In this lesson, we're going to discuss the concept
of facilities and infrastructure support
for our data centers and our recovery sites.
To provide proper facility support,
it's important to consider power, cooling,
and fire suppression.
So we're going to cover uninterrupted power supplies,
power distribution units, generators, HVAC,
and fire suppression systems in this lesson.
First, we have a UPS, or uninterruptible power supply.
Now an uninterruptible power supply,
or uninterruptible power source,
is an electrical apparatus
that provides emergency power to a load
whenever the input power source or main power
is going to fail.
Most people think of these as battery backups,
but in our data centers and telecommunication closets,
we usually see devices
that contain more than just a battery backup.
For our purposes, we're going to use an UPS
that is going to provide line conditioning
and protect us from surges and spikes in power.
Our goal in using an UPS
is to make sure that we have clean reliable power.
Now an UPS is great for short duration power outages,
but they usually don't last more than about 15 to 30 minutes
because they have a relatively short battery life.
The good news is the batteries
are getting better and better every day
and their lives are getting longer and longer
in newer units.
Second, we have power distribution units or PDUs.
Now a power distribution unit
is a device fitted with multiple outputs
designed to distribute electrical power,
especially to racks of computers
and networking equipment located within our data centers.
PDUs can be rack-mounted
or they can take the form of a large cabinet.
In large data center,
you're usually going to see these large cabinets,
and in general,
there's going to be one PDU for each row of servers
and it maintains the high current circuits,
circuit breakers,
and power monitoring panels inside of them.
These PDUs can provide power protection from surges,
spikes, and brownouts,
but they are not designed
to provide full blackout protection like an UPS would
because they don't have battery backups.
Generally, a PDU be combined with an UPS or a generator
to provide that power that is needed during a blackout.
Third, we have generators.
Now large generators are usually going to be installed
outside of a data center
in order to provide us with longterm power
during a power outage inside your region.
These generators can be powered by diesel,
gasoline, or propane.
For example, at my office,
I have a 20,000 kilowatt diesel generator
that's used to provide power in case we have a power outage.
Now the big challenge with a generator though,
is that they take time to get up to speed
until they're ready to start providing power
to your devices.
They can take usually between 45 to 90 seconds.
So you usually need to pair them up
with a battery backup or UPS
as you're designing your power redundancy solution.
For example, at my office, if the power goes out,
the UPS will carry the load for up to 15 minutes.
During that time,
the generator will automatically be brought online,
usually taking 45 to 90 seconds.
Once that generator is fully online,
and providing the right stable power,
and it's ready to take the load,
the power gets shifted
from the UPS batteries to the generator,
using an automatic transfer switch or ATS.
Now once the power is restored in our area
for at least five minutes being steady,
then our ATS will actually shift power back to the grid
through our UPS unit, that battery backup,
and then shut down our generator.
Fourth, we have HVAC units.
HVAC stands for heating, ventilation, and air conditioning.
Our data centers
are going to generate a ton of heat inside of them
because of all these servers, and switches,
and routers, and firewalls,
that are doing processing inside of them.
To cool down these devices,
we need to have a good HVAC system.
Now to help with this cooling,
most data centers are going to utilize
a hot and cold aisle concept.
Now in the simplest form,
each row of servers is going to face another row of servers.
These two rows
will have the front of the servers facing each other
and the rear of the servers facing away from the aisle.
This is because the servers are designed
to push air out the rear of the device.
So the front of the servers is in the cold aisle
and the rear of the servers is in the hot aisle.
This lets us focus our HVAC systems into the hot aisles
to suck that hot air out,
cool it down, and return it back to the cold aisle,
where it can then be circulated over the servers once again.
Remember, proper cooling is important to the health
and security of our networks and our devices.
If the network devices start to overheat,
they will shut themselves down
to protect their critical components,
and if those components get overheated for too long,
permanent damage can occur
or it can decrease the life expectancy of those devices.
Now our fifth and final thing we need to discuss
is fire suppression.
In a data center,
we usually have built-in fire suppression systems.
These can include wet pipe sprinklers,
pre-action sprinklers, and special suppression systems.
Now a wet pipe system is the most basic type
of fire suppression system,
and it involves a sprinkler system and pipes
that always contain water in those pipes.
Now in a server room or data center environment,
this is kind of dangerous
because a leak in that pipe could damage your servers
that are sitting underneath them.
In general, you should avoid using a wet pipe system
in and around your data centers.
Instead, you should use a pre-action system
to minimize the risk of accidental release
if you're going to be using a wet pipe system.
With a pre-action system,
both the detector actuator
is going to work like a smoke detector,
and then there's going to be a sprinkler
that has to be tripped also
before the water is going to be released.
Again, using water in a data center,
even in a pre-action system,
is not really a good idea though, so I try to avoid it.
Instead, I like to rely on special suppression systems
for most of my data centers.
This will use something like a clean agent system.
Now a clean agent is something like halocarbon agents
or inert gases,
which released, the agents will displace the oxygen
in the room with that inert gas
and essentially suffocate the fire.
Now, the danger with using
a special suppressant system like this
is that if there's people working in your data center,
those people can suffocate
if the clean agent is being released.
So your data center needs to be equipped with an alarm
that announces when the clean agent is being released,
and you also need to make sure
there's supplemental oxygen masks available
and easily accessible
by any person who's working in that data center
whenever they hear the alarm go off
for that clean agent release.
So remember, when you're designing your data centers
and your primary work environment or your recovery sites,
you need to consider your power,
your cooling, and your fire suppression needs.
Why do we need quality of service or QoS?
Well, nowadays we operate converge networks,
which means all of our networks are carrying voice, data
and video content over the same wire.
We don't have them all separate out like we used to.
We used to have networks for phones and ones for data
and ones for video,
but now everything's riding over the same IP networks.
So, because of this convergence of mediums,
we have these networks
that now have a high level availability
to ensure proper delivery
over all of these different medians,
because we want a phone to work
every time we pick up the phone, right?
Well, by using QoS, we can optimize our network
to efficiently utilize all the bandwidth at the right time
to deliver the right service to our users
and give a success and cost savings.
Now, we want to have an excellent quality of service,
an excellent service for our customers,
and that's what we're going to start doing by using QoS.
So what exactly is QoS?
Well, quality of service enables us
to strategically optimize our network performance
based on different types of traffic.
Previously, we talked about the fact
that we want to categorize our different traffic types.
I might have web traffic and voice traffic and video traffic
and email traffic.
And by categorizing it
and identifying these different types of traffic,
I can then prioritize that traffic and route it differently.
So I might determine how much bandwidth is required
for each of those types of traffic.
And I can efficiently use my wide area network links
and all that bandwidth available, for maximum utilization,
and save me bandwidth costs over time.
This can help me identify
the types of traffic that I should drop
whenever there's going to be some kind of congestion,
because if you look at the average load,
there's always going to be some peaks and some valleys.
And so we want to be able to figure that out.
So for example, here on the screen,
you can see the peaks and the valleys.
The peaks over time,
and we need to be able to categorize things
to fit within our bandwidth limitations.
So for example, if we have things like VoIP,
or voice over IP, or video service,
they need to have a higher priority,
because if I'm talking to you on a phone,
I don't want a high amount of latency.
From checking my bank balance, for instance, though,
I can wait another half a second for the web page to load.
From listening to you talk, that half a second delay
starts sending like an echo,
and it gives me a horrible service level.
So we want to be able to solve that,
and to do that, we use quality of service.
Now there are different categories of quality of service.
There are three big ones known as delay, jitter and drops.
When I talk about delay,
this happens when you look at the time
that a packet travels from the source to the destination,
this is measured in milliseconds,
and it's not a big deal if you're dealing with data traffic,
but if you're dealing with voice or video,
delay is an especially big thing,
especially if you're doing things live,
like talking on the phone or doing a live stream,
or something like that.
Now, jitter is an uneven arrival of packets,
and this is especially bad in voiceover IP traffic,
because you're using something like UDP.
And so if I sing something to you, like, "my name is Jason,"
and you got "Jason my name is,"
it sounds kind of weird, right?
Now, usually it's not big chunks like that,
but instead it's little bits
and you'll hear these glick and glock sounds
that make it jumble up because of that jitter.
And this really sounds bad, and it's a bad user experience
if you're using voiceover IP.
And so jitter is a really bad thing
when you're dealing with voice and video.
Now, the third thing we have is what's known as a drop.
Drops are going to occur during network congestion.
When the network becomes too congested,
the router simply can't keep up with demand,
and the queue starts overflowing,
and it'll start dropping packets.
This is the way it deals with packet loss,
and if you're using TCP, it'll just send it again.
But again, if I'm dealing with VoIP, VoIP is usually UDP.
And so if we're talking
and all of a sudden my voice cuts out like that,
that would be bad too.
That's why we don't want to have packet drop on a VoIP call.
And so we want to make sure that that doesn't happen.
These network drops are something that can be avoided
by doing the proper quality of service as well.
So when we deal with this,
we have to think about effective bandwidth.
What is our effective bandwidth?
This is an important concept.
So let's look at this client and this server.
There's probably a lot more to this network
than what I'm showing you here on the screen,
but I've simplified it down for this example.
Here, you can see I have my client on the left,
and he wants to talk to the server.
So he goes up through the switch,
which uses 100 megabit per second Cat-5 cable.
Then he goes through a WAN link
over a 256 kilobit per second connection
because he's using an old DSL line.
Then that connects from that ISP over a T1 connection
to another router.
That router connects to an E1 connection to another router.
And from that router, it goes down a WAN link
over a 512 kilobit per second connection,
and then down to a switch with a gigabit connection,
down to the server.
Now, what is my effective bandwidth?
Well, it's 256 kilobits per second,
because no matter how fast any of the other links are,
whatever the lowest link is inside of this connection,
that is going to be your effective bandwidth.
So we talked about quality of service categories,
in our next lesson, we're going to be talking about
how we can alleviate this problem
of this effective bandwidth, and try to get more out of it,
because we need to be able
to increase our available bandwidth, but in this example,
we're limited to 256 kilobits,
which is going to be really, really slow for us.
Now, I like to think about effective bandwidth
like water flowing through pipes.
I can have big pipes and I can have little pipes.
And if I have little pipes,
I'm going to get less water per second through it
than if I have a really big pipe.
And so this is the idea, if you think about a big funnel,
it can start to back up on us, right?
That's the concept,
And we have to figure out how we can fix that
by using quality of service effectively,
which we're going to discuss more in the next video.
When we deal with the quality of service categorization,
we first have to ask,
what is the purpose of quality of service?
Now, the purpose of quality of service is all about
categorizing your traffic and putting it into buckets
so we can apply a policy to certain buckets
based on those traffic categories
and then we can prioritize them based on that.
I like to tell stories and use analogies in my classes
to help drive home points.
And so, since we're talking about
quality of service and traffic,
I think it's important to talk about real-world traffic.
I live in the Baltimore, Washington D.C area.
This area is known for having
some really really bad traffic.
Now, to alleviate this they applied the idea
of quality of service to their traffic system.
They have three different categories of cars.
They have the first car, which is general public.
Anybody who gets in the road and starts driving,
they are part of this group.
Then there's another category
called high occupancy vehicles or HOV.
And so, if I'm driving my car
and I have at least two other passengers with me,
I can get into special HOV only lanes
and I can go a little bit faster.
Now the third bucket is toll roads or pay roads.
And you have to pay to get on these roads.
And based on the time of day
and the amount of traffic there is,
they actually increase or decrease the price.
Now, if it's during rush hour, you might pay 5 or $10
to get in one of those special toll lanes.
But, they're going to travel a whole lot faster
than the regular general commuter lanes or those HOV lanes.
Now, what does this mean in terms of quality of service?
Well, it's really the same thing.
We take our traffic and we go, okay, this is web traffic,
and this is email traffic,
and this is voice or video traffic.
And based on those buckets we assign a priority to them.
And we let certain traffic go first
and we let it get there faster.
Now, when we categorize this traffic
we start to determine our network performance based on it.
We can start figuring out the requirements
based on the different traffic types
and whether it's voice or video or data.
If we're starting to deal with voice or video
because there are things like streaming media
especially in real-time like a Skype call
or a Voice over IP service,
I want to have a very low delay
and therefore a higher priority.
This way I can do this stuff
for streaming media and voice services
and prevent those jitters and drops and things like that
that we talked about before.
Now, this is something that I want to make sure
has a good high priority so I can get it through.
Instead if I have something with a low priority
that might be something like web browsing
or non-mission critical data.
For instance, if my employees are surfing on Facebook,
that would be a very low priority.
Or if I deal with email,
email is generally a low priority
when it comes to quality of service.
Now why is that, isn't email important to you?
Well, because most email is done
as a store and forward communication method.
This means when I send email,
it can sit on my server for 5 or 10 minutes
before it's actually sent out to the end-user
and they'll never realize it.
So that's okay.
It can be a low priority, it'll still get there eventually.
But if I did the same thing with VoIP traffic,
even delaying it by half a second or a second,
you're going to hear jitters and bumps and echoes
and that would be a horrible service.
So, we want to make sure you get high quality of service
for VoIP and lower priority for email.
Now that's just the way we have it set up.
You can have it set up however you want
as long as you understand
what your quality of service policy is,
and you understand it, and your users understand it,
this is going to be okay.
The best way to do that is to document it
and share that with your users.
You want to make sure your users understand your policy
because this will help make sure
that they don't have problems
and start reporting that back to your service desk.
You can do this by posting it to your internal website.
You might post as part of your indoctrination paperwork
or whatever method you want.
You want to make sure those users understand it
because they're the ones who are going to be there
surfing Facebook or watching YouTube.
If you've categorized as a low priority,
they're going to think something's broken.
But if they know it's a low priority,
they understand it's not broken
it's just your corporate policy.
Now, if they're going to be surfing
something on the web that's mission critical,
that's a higher priority and it's going to get
preferential treatment with your quality of service,
they should know that too.
This is the idea here.
We have to make sure that they understand
how we categorize our traffic
and what categories those get put into.
Now, what are some ways that we can categorize our traffic?
Well, there's really three different mechanisms you can use.
We have best effort, integrated services,
and differentiated services.
Now, when we use best effort
this is when we don't have any quality of service at all
and so traffic is just first in, first out,
every man for himself.
We're going to do our best and just try to get it there.
There's really no reordering of packets.
There's no shaping.
It's just pretty much new quality of service.
First in, first out, best effort.
The second type is known as integrated services or IntServ.
This is also known as hard QoS.
There are different names for it
depending on what company you're using
and what routers and switches you're using.
But the idea here is,
we're going to make strict bandwidth reservations.
We might say that all web traffic
is going to get 50% of our bandwidth,
VoIP service is going to get 25%,
and video service is going to get the remaining 25%.
Now, by reserving bandwidth
for each of these signaling devices,
we now decide how much is going to be there
for each of those three categories.
Now, when we do a DiffServ or differentiated services,
also known as soft QoS,
those percentages become more of a suggestion.
There's going to be this differentiation
between different data types
but for each of these packets,
it's going to be marked its own way.
The routers and switches can then make decisions
based on those markings
and they can fluctuate traffic a little bit as they need to.
Now, this is referred to as soft QoS
because even though we set web up as maybe 50%,
it's not as much web browsing going on right now
we can actually take away some of that 50%
and give it over to VoIP and increase that from 25 to 35%.
This way, when somebody wants to browse the web,
we can then take back that extra from VoIP
and give it back to web back to that 50% was originally had
based on those markings and based on those categories.
Now, if we were using hard QoS or that integrated services,
even if we allocate 50% for web browsing
and nobody's using web browsing,
we're still going to have 50% sitting there
waiting to serve people for web browsing.
And that's why a lot of companies prefer to use soft QS.
Now, let's take a look at it like this
because I like to use simple charts and graphs
to try to make it easy to understand.
With best effort at the top,
you have no strict policies at all.
And basically, you just make your best effort
at providing everyone a good quality of service.
Now with DiffServ you have less strict policies,
also known as soft QS.
Now it's better than the best effort approach
but it's still not the most efficient
or effective method of providing a good quality of service
to those who really need it.
Now with IntServ approaches
you're going to have more of a hard QoS limit.
This is what we've talked about before.
Now, this is going to give you the highest level of service
to those within strict policies.
And if you need a really strong quality of service level
then IntServ or hard QoS with a strict policies
can really ensure that you get it.
Now, the way I like to look at this
is as bundles of QoS options that we can choose from.
So which of these bundles is really the best?
Well, it depends.
It depends on your network and it depends on your needs.
But most of the time, it's not going to be a best effort
because that's usually going to give you
not as much quality as you're really going to want here.
Now, when we start categorizing our traffic out there
we're going to start using these different mechanisms,
either soft or hard QS, for doing that.
And we can do that using classification and marking.
We can do it through congestion management
and congestion avoidance.
We can use policing and shaping.
And we can also use link efficiency.
All of these choices fall under a soft QoS or hard QoS
depending on your configuration that you've set up
within your network appliances, firewalls, or routers.
As I mentioned before,
we have different ways of categorizing our traffic.
We can do it through classification, marking,
utilizing congestion management, congestion avoidance,
policing and shaping, and link efficiency.
All of these ways, are ways for us to help implement
our quality of service and take us from this to this.
Now, as you can see,
we want to start shaping out those peaks and valleys
using these different mechanisms
to give us a better quality of service.
Now, when we look at the classification of traffic,
traffic is going to be placed
into these different categories.
Now, this is going to be done
based on the type of traffic that it is.
There's email, but even inside of email,
we have many different classes
of information inside of an email.
If you think about email,
we have POP3 traffic, we have IMAP traffic.
We have SMTP traffic. We have Exchange traffic.
Those are four different types right there.
And so we can look at the headers
and we can look at the packet type of information
and we can even use the ports that are being used.
And then we can determine what services
need higher or less priority.
We can then do this, not just across email,
but across all of our traffic.
And by doing this, this classification
doesn't alter any bits in the frame itself or the packet.
Instead, there is no marking inside of there.
It's all based on the analysis of the packet itself,
the ports and the protocols used,
and our switches and routers are going to implement QoS
based on that information.
Now, another way to do this, is by marking that traffic.
With this, we're going to alter the bits within the frame.
Now we can do this inside frames, cells, or packets,
depending on what networks we're using.
And this will indicate how we handle this piece of traffic.
Our network tools are going to make decisions
based on those markings.
If you look at the type of service header,
it's going to have a byte of information or eight bits.
The first three of that eight bits is the IP Precedence.
The next six of that is going to be
the differential control protocol or DSP.
Now you don't need to memorize
how this type of service is done inside the header.
But I do want you to remember one of the ways
that we can do this quality service
is by marking and altering that traffic.
Next, we have congestion management.
And when a device receives traffic
faster than it can be transmitted,
it's going to end up buffering that extra traffic
until bandwidth becomes available.
This is known as queuing.
The queuing algorithm is going to empty the packets
in a specified sequence and ML.
These algorithms are going to use one of three mechanisms.
There is a weighted fair queuing.
There's a low-latency queuing,
or there is a weighted round-robin.
Now let's look at this example I have here.
I have four categories of traffic:
Traffic 1, 2, 3, and 4.
It really doesn't matter what kind of traffic it is,
for our example right now,
we just need to know that there's four categories.
Now, if we're going to be using a weighted fair queuing,
how are we going to start emptying these piles of traffic?
Well, I'm going to take one from 1, one from 2,
one from 3, and one from 4.
Then I'm going to go back to 1 and 2 and 3 and 4.
And we'll just keep taking turns.
Now, is that a good mechanism?
Well, maybe. It depends on what your traffic is.
If column 1, for example, was representing VoIP traffic,
this actually, isn't a very good mechanism,
because it has us to keep waiting for our turn.
So instead, let's look at this low-latency queuing instead.
Based on our categories of 1, 2, 3, and 4,
we're going to assign priorities to them.
If 1 was a higher priority than 2,
then all of 1 would get emptied,
then all of 2 would get emptied,
and then all 3 and then all of 4.
Now this works well to prioritize things like
voice and video.
But if you're sitting in category 3 or 4,
you might start really receiving
a lot of timeouts and drop packets
because it's never going to be your turn.
And you're just going to wait and wait and wait.
Now the next one we have is called the weighted round-robin.
And this is actually one of my favorites.
This is kind of a hybrid between the other two.
Now with a weighted round-robin,
we might say that category 1 is VoIP,
and category 2 is video, category 3 is web,
and category 4 is email.
And so we might say that in the priority order,
1 is going to be highest
and we're going to use a weighted round-robin,
and we might say, we're going to take three
out of category 1, two out of category 2,
and then one out of 3 and one out of 4.
And we'll keep going around that way.
We'll take three, two, one, one, three, two, one, one.
And we keep going.
That way, VoIP traffic is getting a lot of priority.
Video is getting the second highest priority.
And then we start looking at web and email
at the bottom of the barrel,
but they're still getting a turn
every couple of rounds here.
And so that way it becomes a weighted round-robin.
As I said, this is the quality of service mechanism
that I really like to implement inside my own networks.
Next, we have the idea of congestion avoidance.
As new packets keep arriving, they can be discarded
if the output queue is already filled up.
Now, I like to think about this as a bucket.
As you can see here, I have a cylinder on the bottom
and it has a minimum and a maximum.
Now, if it's already at maximum and you try
to put more into the bucket,
it just overflows over the top.
Now to help prevent this, we have what's called
the RED or random early detection.
This is used to prevent this overflow from happening for us.
As the queue starts approaching that maximum,
we have this possibility
that discard is going to happen.
And so we start doing is we start dropping traffic.
Instead of just dropping traffic randomly,
we're going to drop it based on priority,
with the lowest traffic priority getting dropped first.
RED is going to drop packets from the selected queues
based on their defined limits.
Now I might start dropping TCP traffic first
because I know it'll retransmit itself.
Where UDP, if you drop it, it's gone forever.
And so I might keep that in my queue a little bit longer,
so it doesn't get dropped.
Now, that's the idea here with TCP traffic,
even if I drop it, we're going to get that retransmission
and we'll try again.
But with UDP, if it dropped,
you're never going to know about it,
and you're going to have loss of service.
Now, when you're dealing with congestion avoidance,
we're going to try to use the buffer
to our advantage, and be able to use it to help us
get more bandwidth through.
Now, when we start putting all these things together,
we start getting into these two concepts,
known as policing and shaping.
Policing is going to discard packets
that exceed the configured rate limit,
which we like to refer to as our speed limit.
Just like if you're driving down the highway too fast,
you're going to get pulled over by a cop
and you're going to get a ticket.
That's what policing is going to do for us.
Now, we're just going to go and drop you off the network
anytime you're going too fast.
So, drop packets are going to result in retransmissions,
which then creates more bandwidth.
Therefore, policing is only good
for very high-speed interfaces.
If you're using a dial up modem or an ISDN connection,
or even a T1, you probably don't want to use policing.
You're much better off using our second method,
which is known as shaping.
Now, what shaping is going to do for us
is it's going to allow the buffer
to delay traffic from exceeding the configured rate.
Instead of dropping those packets like we did in policing,
we're just going to hold them in our buffer.
Then when it's empty and there's space available,
we're going to start pushing it
over that empty space and start shaping out the packets.
This is why we call it shaping or packet shaping.
Now you can see what this looks like here on the screen.
I have traffic at the top,
and you'll see all those jagged lines going down.
Now, what really happens here in your network
is there's this high period of time,
and there's low periods of time,
because not everything is happening
all the time in an equal amount.
If we do policing, all we did was chop off the tops,
which gave us more retransmissions and was shaping.
Instead, we're going to start filling
in from the bottom, from our queue.
So it keeps up there right towards the speed limit
without going over it.
Again, shaping does a better job
of maximizing your bandwidth,
especially on slow speed interfaces,
like a T1 connection, a dial up,
satellite connections, or ISDN.
Then the last thing we need to talk about here
is link efficiency.
Now there's a couple of things we need to mention
in regard to link efficiency.
The first of which is compression.
To get the most out of your link,
you want to make it the most efficient possible.
And so to do that, we can compress our packets.
If we take our payloads and we compress it down,
that's going to conserve bandwidth
because it's less ones and zeros
that need to go across the wire.
VoIP is a great thing that you can compress
because there's so much extra space
that's wasted inside of voice traffic.
VoIP payloads can actually be reduced
by up to 50% of their original space.
We could take it down from 40 bytes
down to 20 bytes by using compression.
If you think that's good, look at the VoIP header.
I can compress the VoIP header down
from 90 or 95% of its original value.
I can take it from 40 bites down to just two to four bytes.
To do this, we use something called compressed RTP or cRTP.
Now, when I have the original VoIP payload,
as you can see here, I have an IP address,
I have UDP as my packet type,
and I have RTP for its header.
And then I have my voice payload.
I can compress all of that down into just a cRTP,
which consolidates the IP, the UDP,
and the RTP altogether into one.
The voice payload can also be squeezed down
to about half of its size.
Now you're not going to notice a big difference
in your audio quality either by doing this,
this can be utilized on slower speed links
to make the most of your limited bandwidth.
And it's not just for VoIP.
You can do this with other types of data too.
Compression is a great thing to use.
They have devices out there called WAN accelerators.
That focus specifically on compressing your data
before sending it out your WAN link.
The last thing I want to talk about here
is what we call LFI, which is another method
to make more efficient use of your links.
This is known as link fragmentation and interleaving.
Now what this does is if you have a really big packet,
it'll start chopping those up
and take those big packets and fragment them,
and then interleave smaller packets in between them.
This way, it's going to allow you to utilize
those slower speed links to make the most
of your limited bandwidth.
Notice here I have three voice packets,
and one big chunk of data.
Now what the router would do, is it's going to chop up
and put that one small voice piece
and then one small data piece,
and then one small voice piece,
and one small data piece.
That way, the voice doesn't suffer
from huge latency by waiting for that big piece
of data to go through first.
By doing this fragmentation and interleaving,
it allows you to get some of that high priority traffic out
in between those larger data structures as well.