knowt logo

Network Availability

Network availability, in this section of the course,

we're going to talk about network availability.

Now, network availability is a measure

of how well a computer network can respond to connectivity

and performance demands that are being placed upon it.

This is usually going to be quantitatively measured

as uptime, where we count the amount of time

the network was up,

and divide that by the total amount of time

covered by the monitoring period.

For example, if I monitor the network for a full year

and I only had five minutes and 16 seconds of downtime

during that year, this would equate to an uptime of 99.999%.

Now this is known as the five nines of availability.

And it's considered the gold standard

in network availability.

This is considered to be an extremely available

and high quality network,

but there is going to be downtime in your networks.

It's a fact of life, devices fail, connections go down,

and incorrect configurations are sometimes going to be applied.

There are a lot of reasons that downtime occurs,

but our goal is to minimize that downtime

and increase our availability.

In order to reach the highest levels of availability,

we need to build in availability and redundancy

into our networks from the beginning.

We also are going to use quality of service

to ensure that our end-users are happy with the networks

and the services that we're providing them.

So in this section on network availability,

we're really going to focus on two domains.

We're going to see domain two,

which is network implementations,

and domain three, network operations.

In here, we're going to talk about two objectives.

Objective 2.2 and objective 3.3.

Now objective 2.2 states, that you must compare and contrast

routing technologies and bandwidth management concepts.

Objective 3.3 states,

that you must explain high availability

and disaster recovery concepts

and summarize which is the best solution.

So let's get started talking

all about the different ways for us

to increase the availability, reliability,

and quality of service within our networks

in this section of the course.

High availability.

In this lesson,

we're going to talk all about high availability.

Now, when we're talking about high availability,

we're really talking about making sure our systems are up

and available.

Availability is going to be measured in what we call uptime

or how many minutes or hours you're up and available

as shown as a percentage.

Usually, you're going to take the amount of minutes

you were up,

divided by the total amount of minutes in a period,

and that gives you a percentage known as uptime.

Now, we try to maintain what is known as the five nines

of availability in most commercial networks.

This is actually really hard because that's 99.999%.

That means I get a maximum of about five minutes

of downtime per year,

which is not a whole lot of downtime.

In some cloud based networks,

they aim for six nines of availability or 99.99999%.

This equates to just 31 seconds of downtime

each and every year.

Now, as you imagine,

I need more than 31 seconds of downtime

or even five minutes of downtime

fueled up patch my servers and install a new hard drive

or put in a new router or switch when one fails.

So, how do I maintain that high level of availability?

Well, I'm going to do that,

by designing my networks to be highly available.

Now, there are two terms you need to understand

and be familiar with,

when we talk about high availability.

There is availability and reliability,

and these are different things.

When I'm talking about availability,

this is concerned with being up and operational.

When I talk about reliability,

I'm concerned with not dropping packets

inside of my network.

If your network is highly available,

but it's not reliable,

it's not a very good network

because it's dropping things all the time

and isn't doing what it's supposed to.

But conversely,

you can have a really highly reliable network,

but if it's not a highly available one,

nobody can use it either because it's down all the time.

So that wouldn't be good either.

So, let's say I had the most highly reliable network

in the entire world,

but it's only up 20 minutes a year.

That's not going to be any good, right?

So, we want to make sure we balance these two things.

We have to aim for good enough in both areas

to meet our business needs based on the available resources

and the amount of money we have to build our networks.

So, when we measure our different network components,

we have to determine how highly available they are.

And we do that through measurement of MTTR and MTBF.

Now, MTTR is the mean time to repair.

This measure is the average time it takes to repair

a network device when it breaks.

After all,

everything is going to break eventually.

So, when a device breaks,

how long does it take for you to fix it?

And how much downtime are you going to experience?

That is what we're trying to measure

when we deal with the mean time to repair.

Now, the mean time between failures or MTBF,

is going to measure the average time

between when a failure happens on a device

and the next failure happens.

Now, these two terms can often be confusing.

So, let me display it on a timeline and explain a little bit

about what they look like in the real world.

Now, let's say I had a system failure

at this first stop sign here on the left side.

Then we resume normal operations because we fix things.

That amount of time, was the time to repair.

Now, if I average all the times to repair

over the entire year for that type of device,

that's going to give me my MTTR,

my mean time to repair, the average time to repair.

Now, on the failure side of things,

we want to measure the failure time from one failure

thus using it, fixing it,

and then another failure happens.

This becomes the time between the failures.

If I average all those together,

I get the average time between failures

or the mean time between failures, MTBF.

Hopefully, you can see the difference here.

Remember, when we're dealing with mean time to repair,

we want this to be a very small number.

When we deal with the mean time between failures,

we want this to be a very large number.

This means,

and for the very small number for mean time to repair,

we can fix things really quickly

and get ourselves back online.

So, the lower the mean time to repair is,

the better the network availability.

Now, on the other hand,

we start talking about mean time between failures,

we want a really long time

because this means that the device has stayed up

and operational for a very long time before they fail.

This is going to give us better network availability,

and overall, it should give us better reliability too.

Now, we don't want a lot of failures here.

And so the more time in between failures,

the better that is for our network.

So, how do we design these networks

to be highly reliable and highly available?

Well, we're going to add redundancy to our networks

and their devices.

Now, redundancy can be achieved through a single device

or by using multiple devices.

If you're using a single device,

you're still going to have single points of failure

in your network,

but it is cheaper than being fully hardware redundant.

Let's take a look at this concept for a moment.

Here you could see a single point of failure in my network.

Even though I have two switches and multiple connections

between those switches,

which gives me additional redundancy,

that router is not giving me additional redundancy.

It's a single point of failure

because it's the only router I have.

So, even if the router has internal hardware redundancy,

like two power supplies and two network cards,

I still only have one router chassis and one circuit board

running in that router.

So, if that router goes down,

this entire network is going to stop.

Therefore, this is considered a single point of failure.

Now instead,

I could redesign the network

and I can increase its redundancy by doing this.

Notice, I now have two PCs that want to talk to each other.

And each of them has dual network interface cards

talking to two different switches.

And each of those switches talks to two different routers.

Everything is connected to everything else

in a mesh topology for these network devices.

This gives me multiple connections between each device

and provides me with link redundancy, component redundancy,

and even inside those devices,

I may have two network cards, two power supplies,

and to every other internal network component there is,

so that I have a very redundant

and highly available network.

Now, if one of those routers needs to be upgraded,

I can take it offline and update its firmware,

and then the entire time that second router

is still on the network,

maintaining the load and providing service to all the users.

Then I can put the first router back on the network,

take off the second router and then do its upgrades.

By doing this and taking turns,

I still am able to have network functions run,

and I have no downtime associated with this network.

This is how you keep a network highly available.

Now, let's talk a little bit more about hardware redundancy.

Inside these routers and other network devices,

we can have hardware redundancy or the devices themselves

could be hardware redundant.

Now, if I have two routers and they're both

serving the same function,

this is considered hardware redundancy.

I could also have hardware redundancy in the components

by having two network cards or two hard drives

or two internal power supplies on a single device.

That way, if one of them fails, the second one takes over.

Usually, you're going to find this

in strategic network devices,

things like your switches, your routers, your firewalls,

and your servers,

because you can't afford a failure

in any one of those devices,

because they would take down large portions

of your network or its services.

On the other hand, if I'm considering my laptop,

I only have one hard drive in it.

If that laptop fails or that hard drive fails,

I would just deal with that downtime.

I might buy a new laptop or a new hard drive

and then restore from an old backup.

That would get me back up and running.

Now, when we're working with end-user devices

like workstations and clients,

we often don't deal with redundancy.

But when you start getting to servers and routers

and switches and firewalls,

you need to start having hardware

and component level redundancy

because these serve lots of end-users.

We deal with this redundancy,

we can then cluster our devices and have them work either

an active-active,

or active-passive configuration.

All right.

Let's assume I have this one computer

and it has two network interface cards

that are connected to the network.

Do I want to talk to both routers at the same time?

Well, if I'm active-active,

then both of those network interface cards

are going to be active at the same time,

and they each are going to have their own Mac address,

and they're going to be talking at the same time

to either of these two routers.

This can then be done to increase the bandwidth

of this computer and load balance

across both network interface cards.

This is known as Network Interface Card teaming,

or NIC teaming,

where a group of network interface cards,

is used for load balancing and failover for a server

or another device like that.

Now, on the other hand,

we can use active-passive,

and this is going to have a primary

and a backup network interface card.

Now, one of these cards is going to be active

and being used at all times.

And when it fails,

the other card is going to go from standby and take over.

In this case,

there is no performance increased by having two cards,

but I have true redundancy and failover capabilities.

In an active-passive configuration,

both NICs are going to be working together

and they're going to have a single Mac address

that they're going to display to the network,

so they look like they're a single device.

Now, when you start looking at redundancy at layer three,

we're going to start talking about our routers.

Now here,

our clients are getting configured with a default gateway,

which is our router by default.

But, if the default gateway went down,

we wouldn't be able to leave the sub-net,

and so it'd be stuck on the internal network.

Now, we don't want that.

So instead,

we want to add some redundancy

and we'll use layer three redundancy

using a virtual gateway.

To create a virtual gateway,

we need to use either

the First Hop Redundancy Protocol, FHRP,

or the Virtual Router Redundancy Protocol, VRRP.

Now, the most commonly used First Hop Redundancy Protocol

is known as HSRP or the Hot Standby Router Protocol.

This is a layer three redundancy protocol

that's used as a proprietary First Hop Redundancy Protocol

in Cisco devices.

HSRP is going to allow for an active and a standby router

to be used together.

And instead,

we get a virtual router that's defined

as our default gateway.

The client devices like the workstations and servers

are then going to be configured to use the virtual router

as its gateway.

When the PC communicates to the IP of the virtual router,

the router will determine which physical router is active

and which one is standby.

And then, it forwards the traffic to that active router.

If the active router goes down,

the standby router will pick up the responsibility

for that active router

until the other router comes back online

and takes over its job again.

Now, with VRRP,

the Virtual Router Redundancy Protocol,

this is one that was created

by the internet engineering task force.

Is an open standard variant

of the Hot Standby Router Protocol or HSRP.

VRRP allows for one master or active router,

and the rest can then be added in a cluster as backups.

Unlike HSRP,

where you can only have one router as active,

and one as standby.

VRRP is going to allow you to have multiple standby routers.

Just like HSRP,

you're going to configure the VRRP

to create a virtual router

that's going to be used as a default gateway

for all of your client devices.

Now, in order to provide load balancing on your networks

and to increase both redundancy

and performance of your networks,

you can use GLBP,

which is the Gateway Load Balancing Protocol,

or you can use LACP the Link Aggregation Control Protocol.

Now, GLBP or the Gateway Load Balancing Protocol

is a Cisco protocol,

and it's another proprietary First Hop Redundancy Protocol.

Now, GLBP will allow us to create a virtual router

and that virtual router will have two routers

being placed behind it,

in active and standby configuration.

The virtual router or gateway will then forward traffic

to the active or standby router

based on which one has the lower current loading,

when the gateway receives that traffic.

If both can support the loading,

then the GLBP will send it to the active

since it's considered the primary device.

By using GLBP,

you can increase the speeds of your network

by using load balancing between two routers or gateways.

Now, the second thing we can use,

is LACP or Link Aggregation Control Protocol.

This is a redundancy protocol that's used at layer two.

So we're going to be using this with switches.

LACP is going to achieve redundancy by having multiple links

between the network devices,

where load balancing over multiple links can occur.

The devices are all going to be considered

part of a single combined link

even when we have multiple links.

This gives us higher speeds and increases our bandwidth.

For example,

let's pretend I have four Cat5 cables.

Each of these are connected to the same switch.

Now, each of those cables has 100 megabits per second

of bandwidth.

Now, if I use the Link Aggregation Control Protocol,

I can bind these altogether and aggregate them

to give me 400 megabits per second

of continuous bandwidth by creating

this one single combined bandwidth,

from those four connections.

Now, let's consider what happens

when that is trying to leave our default gateway

and get out to the internet.

Now in your home,

you probably only have one internet connection,

but for a company,

you may wish to have redundant paths.

For example, at my office,

we have three different internet connections.

The first is a microwave link that operates

at 215 megabits per second for uploads and downloads.

The second, is a cable modem connection.

It operates at 300 megabits per second for downloads

and 30 megabits per second for uploads.

Now the third is a cellular modem,

and that gives me about 100 megabits per second

for downloads and about 30 megabits per second for uploads.

Now, the reason I have multiple connections,

is to provide us with increased speed and redundancy.

So, to achieve this,

I take all three of these connections

and connect them to a single gateway

that's going to act as a load balancer.

If all the connections are up and running,

they're going to load balance my traffic

across all three of those connections

to give me the highest speeds at any given time.

But, if one of those connections drops,

the load balancer will remove it from the pool

until it can be returned to service.

By doing this,

I can get a maximum speed of about 615 megabits per second

for a combined download.

And on the upload,

I can get about 310 megabits per second,

when using all three connections

and they're all functioning in online.

Similarly,

you might be in an area where you can get fiber connections

to your building.

Now, in those cases,

you may purchase a primary and a backup connection.

And if you do,

you should buy them from two different providers.

If both of your connections are coming from the same company

and they go down,

well, guess what?

You just lost both of your connections

because the upstream ISP might be down.

For this reason,

it's always important to have diversity in your path

when you're creating link redundancy,

just like I did in my office.

I have a microwave connection through one ISP.

I have a cable modem through another ISP,

and I have a cellular modem through a third ISP.

That way,

if anyone goes down,

I still have two other paths I can use.

Now, the final type of redundancy that we need to discuss,

is known as multipathing.

Multipathing is used in our storage area networks.

Multipathing is used to create more than one physical path

between the server and its storage devices.

And this allows for better fault tolerance

and performance enhancements.

Basically, think of multipathing

as a form of link aggregation,

but instead of using it for switches,

we're going to use it for our storage area networks.

In the last lesson I showed you a couple of diagrams

of redundant networks,

but one of the things we had to think about in this lesson

are the considerations we have

when we start designing these redundant networks.

First you need to ask yourself,

are you going to use redundancy in the network,

and if so, where, and how?

So are you going to do it from a module or a parts perspective?

For instance, are you going to have multiple power supplies,

multiple network interface devices, multiple hard drives,

or are you going to look at it more from a chassis redundancy

and have two sets of routers or two sets of switches?

These are things you have to think about.

Which one of these are you going to use,

because each one is going to affect the cost

of your network, based on the decisions you make.

You have to be able to make a good business case

for which one you're going to use, and why.

For instance, if you could just have

a second network interface card or a second power supply,

that's going to be a lot cheaper

than having to have an entire switch

or an entire extra router there.

Now, each of those switches and routers,

some of these can cost 3 or 4 or $5,000,

and so it might be a lot cheaper

to have a redundant power supply, right,

and so these are the things you have to think about

and weigh as you're building your networks.

Another thing you have to think about

is software redundancy,

and which features of those are going to be appropriate.

Sometimes you can solve a lot of these redundancy problems

by using software as opposed to hardware.

For example, if you have a virtual network setup,

you could just put in a virtual switch

or a virtual router in there,

and that way you don't have to bring

another real router or real switch in,

that can save you a lot of money.

There's also a lot of other software solutions out there,

like a software RAID,

that will give you additional redundancy

for your storage devices,

as opposed to putting in an extra hard drive chassis,

or another RAID array or storage area network.

Also, these are the types of things

you have to be thinking about

as you're building out your network, right?

When you think about your protocols,

what protocol characteristics

are going to affect your design requirements?

This is really important if you're designing things,

and you're using something like

TCP versus UDP in your designs,

because TCP has that additional redundancy

by resending packets, where UDP doesn't,

this is something you have to consider as well.

As you design all these different things,

all of these different factors are going to work together,

just like gears, and each one turns another,

and each one is going to feed another one,

and the more reliability and availability

you get in your networks

by adding all these components together.

In addition to all this,

there are other design considerations

that we have to think about as well,

like what redundancy features should we use

in terms of powering the infrastructure devices?

Are we going to have internal power supplies

and have two of those, and have them redundant?

Or, are we going to have battery backups, or UPSs,

are we going to have generators?

All of these things are things you have to think about,

and I don't have necessarily the right answers for you,

because it all comes down to a case-by-case basis.

Every network is going to be different,

and every one has its own needs

and its own business case associated with it.

The network that I had at former employers

were serving hundreds of thousands of clients,

and those were vastly different than the ones

that are servicing my training company right now,

with just a handful of employees.

Because when you're dealing with your network design

and your redundancies,

you have to think about the business case first.

Each one is going to be different

based on your needs and your considerations.

What redundancy features should be used

to maintain the environmental conditions of your space?

If you have good power and space and cooling,

you need to make sure

that you're thinking about air conditioning,

and do you have one unit or two?

Do you have generators onsite?

Do you have additional thermal heating or thermal cooling?

All of these things are things you have to think about.

What do you do when power goes down?

What are some of those things

that you're going to have to deal with

if you're running a server farm

that has to have units running all the time,

because it can't afford to go down

because it's going to affect

thousands and thousands of people,

instead of just your one office with 20 people?

All of these are things you have to consider

as you think about it.

In my office, we made the decision

that one air conditioning unit was enough,

because if it goes down, we might just not work today

and we'll come to work tomorrow, we can get over that.

But in a server farm,

we need to make sure we have multiple air conditioners,

because if that goes down

it can actually burn up all the components, right?

So we have to have additional power and space and cooling

that are fully redundant,

because of that server infrastructure

that we're supporting there.

These are the things you have to balance in your practices.

And so when you start looking at the best practices,

I want you to examine your technical goals

and your operational goals.

Now what I mean by that is,

what is the function of this network?

What are you actually trying to accomplish?

Are you trying to get to 90% uptime, or 95%, or 99%,

or are you going for that gold standard

of five nines of availability?

Every company has a different technical goal,

and that technical goal is going to determine

the design of your network.

And you need to identify that

inside of your budgeting as well,

because funding these high-availability features

is really expensive.

As I said, if I want to put a second router in there,

that might cost me another 3,000 or $5,000.

In my own personal network,

we have a file server, and it's a small NAS device.

We're not comfortable just having all of our devices there,

so we decided we weren't comfortable

having all of our file storage on a single hard drive,

and we built this NAS array instead,

so if one of those drives goes out,

we have three others that are carrying the load.

This is the idea here.

Now, eventually we decided we didn't need that NAS anymore,

and so we replaced that NAS enclosure with a full RAID 5.

Later on we took that full RAID 5

and we switched it over to a cloud server

that has redundant backups

in two different cloud environments.

And so all of these things work together

based on our decisions,

but as we moved up that scale

and got more and more redundancy,

we have more and more costs associated.

It was a lot cheaper just to have an 8-terabyte hard drive

with all of our files on it,

then we went to a NAS array

and that cost two or three times that money,

then we went to a full RAID 5

and that cost a couple more times that,

then we went to the cloud and we have to pay more for that.

Remember, all your decisions here

are going to cost you more money,

but if it's worth it to you, that would be important, right,

and so these are the things you have to balance

as you're designing these fully redundant networks,

based on those technical goals.

You also need to categorize

all of your business applications into profiles,

to help with this redundancy mission

that you're trying to go and accomplish here.

This will really help you as you start

going into the quality of service as well.

Now if I said, for instance,

that web is considered category one

and email is category two

and streaming video's going to be category three,

then we can apply profiles

and give certain levels of service

to each of those categories.

Now we'll talk specifically of how that works

when we talk about quality of service in a future lesson.

Another thing we want to do

is establish performance standards

for our high-availability networks.

What are the standards that we're going to have to have?

These standards are going to drive

how success is measured for us,

and in the case of my file server, for instance,

we measure success as it being up and available

when my video editors need to access it,

and that they don't lose data,

because if we lost all of our files,

that'd be bad for us, right?

Those are two metrics that we have,

and we have numbers associated with each of those things.

In other organizations, we measure it based on the uptime

of the entire end-to-end service,

so if a client can't get out to the internet for an ISP,

that would be a bad thing, that's one of their measurements.

Now the other one might be, what is their uptime?

All of these performance standards are developed

through metrics and key performance indicators.

If you're using something like ITIL

as your IT service management standards,

this is what you're going to be doing as you're trying

to run those inside your organization as well.

Finally, here we wanted to find how we manage and measure

the high-availability solutions for ourselves.

Metrics are going to be really useful to quantify success,

if you develop those metrics correctly.

Decision-makers and leaders love seeing metrics.

They love seeing charts and seeing the performance,

and how it's going up over time,

and how our availability is going up,

and how our costs are going down.

Those are all good things,

but if you don't know what you're measuring

or why you're measuring it,

it really goes back to your performance standards.

Then, these are the kind of things

that are wasting your time with metrics.

A lot of people measure a lot of things,

and they don't really tell you

if you're getting the outcome you're wanting.

I want to make sure that you think about

how you decide on what metrics you're going to use.

Now, we've covered a lot of different design criteria

in this lesson, but the real big takeaway here

that I want you to think about is this.

If you have an existing network,

you can add availability to it,

and you can add redundancy to it.

You can retrofit stuff in,

but it's going to cost you a lot more time

and a lot more money.

It is much, much cheaper

to design this stuff early in the process

when you start building a network from scratch.

So, if you're designing a network and you're asked early on

what kind of things you need,

I want you to think about all these things of redundancy

in your initial design.

Adding them in early is going to save you a lot of money.

Every project has three main factors,

time, cost, and quality,

and usually, one of these things is going to suffer

at the expense of the other two.

For example, if I asked you to build me a network

and I want it to be fully redundant

and available by tomorrow, could you do it?

Well, maybe, but it's probably going to cost me a lot of money,

and because they give you very little time,

it's going to cost me even more,

or your quality is going to suffer.

So, you could do it good, you could do it quick,

or you could do it cheap, but you can't do all three.

It's always going to be a trade-off between these three things,

and I want you to remember

as you're out there and you're designing networks,

you need to make sure you're thinking about your redundancy

and your availability and your reliability,

because often that quality is going to suffer

in favor of getting things out quicker

or getting things out cheaper.

Recovery sites.

In this lesson, we're going to discuss the concept

of recovery sites.

After all things are going to break and your networks

are going to go down.

This is just a fact of life.

So what are you going to do when it comes time

to recover your enterprise network?

Well, that's what we're going to discuss in this lesson.

When it comes to designing redundant operations

for your company,

you really should consider a recovery site.

And with recovery sites, you have four options.

You see, you can have all the software and hardware

redundancy you want.

But at the end of the day,

sometimes you need to actually recover your site too.

Now this could be because there's a fire that breaks out

in your building or a hurricane or earthquake.

All of these things might require you to relocate

and if you do, you're going to have to choose

one of four options.

This could be a cold site, a warm site, a hot site

or a cloud site.

Now when we deal with cold sites,

this means that you have a building that's available

for you to use,

but you don't have any hardware or software in place.

And if you do, those things aren't even configured.

So you may have to go out to the store and buy routers

and switches and laptops and servers

and all that kind of stuff.

You're going to bring it to a new building, configure it

and then restore your network.

This means that while recovery is possible,

it's going to be slow and it's going to be time consuming.

If I have to build you out a new network and a cold site,

that means I'm going to need you to bring everything in

after the bad thing has already happened,

such as your building catching fire.

And this can take me weeks or even months

to get you fully backing up and running.

Now, the biggest benefit of using a cold site

is that it is the cheapest option

that we're going to talk about.

The drawbacks are that it is slow and essentially

this is just going to be an empty building

that's waiting for you to move in and start rebuilding.

Now next, we have a warm site.

A warm site means you have the building available

and it already contains a lot of the equipment.

You might not have all your software installed

on these servers or maybe you don't have the latest security

patches or even the data backups from your other site

haven't been recovered here yet.

But you do already have the hardware

and the cabling in place.

With a warm site,

we already have a network that's running the facility.

We have switches and routers and firewalls.

But we may not maintain it fully

each and every day of the year.

So, when a bad event happens

and you need to move into the warm site,

we can load up our configurations on our routers

and switches, install the operating systems on the servers,

restore the files from backup

and usually within a couple of days,

we can get you back up and running.

Normally with a warm site,

we're looking to restore the time between 24 hours

and seven days.

Basically, under a week.

Recovery here is going to be fairly quick,

but not everything from the original site

is going to be there and ready for all employees

at all times.

Now, if speed of recovery is really important to you,

the next type of site is your best choice.

It's known as a hot site.

Now hot site is my personal favorite.

But it's also the most expensive to operate.

With a hot site, you have a building, you have the equipment

and you have the data already on site.

That means everything in the hot site is up and running

all the time.

Ready for you to instantly switch over your operations

from your primary site to your hot site

at the flip of a switch.

This means you need to have the system and network

administrators working at that hut site every day

of the year, keeping it up and running, secured

and patched and ready for us to take over operations

whenever we're told to.

Basically, your people are going to walk out of the old site,

get in their car, drive to the new site, login

and they're back to work as if nothing ever happened.

This is great because there's very minimal downtime.

And you're going to have nearly identical levels of servers

at the main site in the hut site.

But as you can imagine, this costs a lot of money.

Because I have to pay for the building,

two sets of equipment, two sets of software licenses

and all the people to run all this stuff.

You're basically running two sites at all times.

Therefore, a hot site gets really expensive.

Now a hot site is very critical

if you're in a high availability type of situation.

Let's say you work for a credit card processing company.

And every minute they're down cost them millions of dollars.

They would want to have a hot site, right.

They don't want to be down for three or four weeks.

So they have to make sure they have their network up

and available at all times.

Same thing if you're working for the government

or the military,

they always need to make sure they're operating

cause otherwise people could die.

And so they want to make sure that is always up and running.

That's where hot sites are used.

Now if you can get away from those type of criticality

requirements though, which most organizations can.

You're going to end up settling on something like a warm site,

because it's going to save you on the cost of running

that full recovery hot site.

Now the fourth type of site we have

is known as a cloud site.

Now a cloud site isn't exactly a full recovery site,

like a cold warm or hot site is.

In fact, there may be no building for you to move

your operations into.

Instead, a cloud site is a virtual recovery site

that allows you to create a recovery version

of your organization's network in the cloud.

Then if disaster strikes, you can shift all your employees

to telework operations by accessing that cloud site.

Or you can combine that cloud site with a cold or warm site.

This allows you to have a single set of system

administrators and network administrators

that run your day to day operational networks

and they can also run your backup cloud site.

Because they can operate at all

from wherever they're sitting in the world.

Now cloud sites are a good option to use,

but you are going to be paying a cloud service provider

for all the compute time, the storage

and the network access required to use that cloud site

before, during and after the disastrous event.

So, which of these four options should you consider?

Well, that really depends on your organization.

It's recovery time objectives, the RTO

and its recovery point objectives, RPO.

Now the recovery time objective or RTO

is the duration of time and service level

within which a business process has to be restored

after disaster happens in order to avoid unacceptable

consequences associated with a breaking continuity.

In other words, our RTO is going to answer our question,

how much time did it take for the recovery to happen

after the notification of a business process disruption?

So, if you have a very low RTO,

then you're going to have to use either a hot site

or a cloud site because you need to get up and running

quickly.

That is the idea of a low RTO.

Now on the other hand, we have to think about our RPO.

Which is our recovery point objective.

Now RPO is going to be the interval of time that might pass

during the disruption before the quantity of data loss

during that period exceeds the business continuity plans

maximum allowable threshold or tolerance.

Now RPO is going to determine the amount of data

that will be lost or will have to be re-entered

during network operations in downtime.

It symbolizes the amount of data that can be acceptably lost

by the organization.

For example, in my company we have an RPO of 24 hours.

That means if all of our servers crashed right now,

I as the CEO have accepted the fact that I can lose no more

than the last 24 hours worth of data and that would be okay.

To achieve this RPO,

I have daily backups that are conducted every 24 hours.

So, we can ensure we always have our data backed up

and ready for restoral at any time.

And that means we will lose at most 24 hours worth of data.

The RTO that recovery time objective is going to be focused

on the real time that passes during a disruption.

Like if you took out a stopwatch and started counting.

For example, can my business survive

if we're down for 24 hours?

Sure.

It would hurt, we would lose some money, but we can do it.

How about seven days?

Yeah, again, we would lose some money,

we'd have some really angry students,

but we could still survive.

Now, what about 30 days?

No way.

Within 30 days all of my customers and students,

they would have left me.

They would take their certifications

through some other provider out there

and I would be out of business.

So I had to figure out what my RTO someplace between one

and seven days to make me happy.

So that's the idea of operational risk tolerance,

we start thinking about this from an organizational level.

How much downtime are you willing to accept?

Based on my ability to accept seven days,

I could use a warm site instead of a hot site.

But if I currently accept 24 hours of downtime

or five minutes of downtime,

then I would have to use a hot site instead.

RTO is used to designate that amount of real time

that passes on the clock before that disruption

begins to have serious and unacceptable impedances

to the flow of our normal business operations.

That is the whole concept here with RTO.

Now when we start talking about RPO and RTO,

you're going to see this talked about a lot in backups

and recovery as well.

When you deal with backups and recovery,

you a few different types of backups.

We have things like full backups, incremental backups,

differential backups and snapshots.

Now a full backup is just what it sounds like.

It's a complete backup of every single file on a machine.

It is the safest and most comprehensive backup method,

but it's also the most time consuming and costly.

It's going to take up the most disk space

and the most time to run.

This is normally going to be run on your servers.

Now another type of backup we have

is known as an incremental backup.

With an incremental backup, I'm going to back up the data

that changed since the last backup.

So, if I did a full backup on Sunday

and I go to do an incremental backup on Monday,

I'm only going to back up the things that have changed

since doing that full backup on Sunday.

Now another type we have is known as a differential backup.

A differential backup is only going to back up the data

since the last full backup.

So, let's go back to my example

of Sunday being a full backup

and then I did an incremental backup on Monday.

Then that backup is going to copy everything since Sunday.

But if I do an incremental on Tuesday, it's only going to do

the difference between Monday and Tuesday.

Cause Monday was the last backup on the incremental backup.

When I do it Wednesday,

I'm going to get from Tuesday to Wednesday.

And so when I do these incrementals,

I now have a bunch of smaller pieces

that to put back together when I want to restore my servers.

Now at differential on the other hand is going to be

the entire difference since the last full backup.

So if on Wednesday I did a differential backup,

I'm going to have all the data that's different from Sunday,

the last full backup all the way up through Wednesday.

This is the difference between the differential

and an incremental.

So if I do a full backup on Sunday

and then I do a differential on Monday.

Monday I did an incremental and the differential,

they're going to look the exact same.

But on Tuesday the incremental is only going to include

the stuff since Monday.

But the differential will include everything since Sunday.

This includes all of Monday and Tuesdays changes.

And so you can see how this differential is going to grow

throughout the week until I do another full backup

on the next Sunday.

Now I do an incremental, it's only that last 24 hour period.

Now the last type of backup we have is known as a snapshot.

Now if you're using virtualization

and you're using virtual machines,

this becomes a read only copy of your data frozen in time.

For example, I use snapshots a lot when I'm using virtual

machines or I'm doing malware analysis.

I can take a snapshot on my machine,

which is a frozen instant time.

And then I can load the malware and all the bad things

I need to do.

And then once I'm done doing that,

I can restore back to that snapshot which was clean

before I installed all the malware.

This allows me to do dynamic analysis of it.

Now if you have a very large Sand array or storage area

or network array,

you can take snapshots of your servers

and your virtual machines in a very quick and easy way

and then you'll be able to restore them exactly back

to the way they were at any given moment in time.

Now when we use full, incremental and differential,

most of the time those are going to be used with tape backups

and offsite storage.

But if you're going to be doing snapshots,

that's usually done to a disc like a storage area array.

Now, in addition to conducting your backups of your servers,

it's also important to conduct backups

of your network devices.

This includes their state and their configurations.

The state of a network device contains all the configuration

and dynamic information from a network device

at any given time.

If you export the state of a network device,

it can later be restored to the exact same device

or another device of the same model.

Similarly, you can backup just the configuration information

by conducting a backup of the network device configuration.

This can be done using the command line interface

on the device or using third-party tools.

For example, one organization I worked for

had thousands of network devices.

So we didn't want to go around and do a weekly configuration

backup for all those devices individually.

Instead, we configure them to do that using the tool

known as SolarWinds.

Now once a week, the SolarWinds tool would back up

all the configurations and store them

on a centralized server.

This way, if we ever had a network device that failed,

we could quickly install a spare from our inventory,

restore the configurations from SolarWinds

back to that device and we will be back online

in just a couple of minutes.

Facilities support.

In this lesson, we're going to discuss the concept

of facilities and infrastructure support

for our data centers and our recovery sites.

To provide proper facility support,

it's important to consider power, cooling,

and fire suppression.

So we're going to cover uninterrupted power supplies,

power distribution units, generators, HVAC,

and fire suppression systems in this lesson.

First, we have a UPS, or uninterruptible power supply.

Now an uninterruptible power supply,

or uninterruptible power source,

is an electrical apparatus

that provides emergency power to a load

whenever the input power source or main power

is going to fail.

Most people think of these as battery backups,

but in our data centers and telecommunication closets,

we usually see devices

that contain more than just a battery backup.

For our purposes, we're going to use an UPS

that is going to provide line conditioning

and protect us from surges and spikes in power.

Our goal in using an UPS

is to make sure that we have clean reliable power.

Now an UPS is great for short duration power outages,

but they usually don't last more than about 15 to 30 minutes

because they have a relatively short battery life.

The good news is the batteries

are getting better and better every day

and their lives are getting longer and longer

in newer units.

Second, we have power distribution units or PDUs.

Now a power distribution unit

is a device fitted with multiple outputs

designed to distribute electrical power,

especially to racks of computers

and networking equipment located within our data centers.

PDUs can be rack-mounted

or they can take the form of a large cabinet.

In large data center,

you're usually going to see these large cabinets,

and in general,

there's going to be one PDU for each row of servers

and it maintains the high current circuits,

circuit breakers,

and power monitoring panels inside of them.

These PDUs can provide power protection from surges,

spikes, and brownouts,

but they are not designed

to provide full blackout protection like an UPS would

because they don't have battery backups.

Generally, a PDU be combined with an UPS or a generator

to provide that power that is needed during a blackout.

Third, we have generators.

Now large generators are usually going to be installed

outside of a data center

in order to provide us with longterm power

during a power outage inside your region.

These generators can be powered by diesel,

gasoline, or propane.

For example, at my office,

I have a 20,000 kilowatt diesel generator

that's used to provide power in case we have a power outage.

Now the big challenge with a generator though,

is that they take time to get up to speed

until they're ready to start providing power

to your devices.

They can take usually between 45 to 90 seconds.

So you usually need to pair them up

with a battery backup or UPS

as you're designing your power redundancy solution.

For example, at my office, if the power goes out,

the UPS will carry the load for up to 15 minutes.

During that time,

the generator will automatically be brought online,

usually taking 45 to 90 seconds.

Once that generator is fully online,

and providing the right stable power,

and it's ready to take the load,

the power gets shifted

from the UPS batteries to the generator,

using an automatic transfer switch or ATS.

Now once the power is restored in our area

for at least five minutes being steady,

then our ATS will actually shift power back to the grid

through our UPS unit, that battery backup,

and then shut down our generator.

Fourth, we have HVAC units.

HVAC stands for heating, ventilation, and air conditioning.

Our data centers

are going to generate a ton of heat inside of them

because of all these servers, and switches,

and routers, and firewalls,

that are doing processing inside of them.

To cool down these devices,

we need to have a good HVAC system.

Now to help with this cooling,

most data centers are going to utilize

a hot and cold aisle concept.

Now in the simplest form,

each row of servers is going to face another row of servers.

These two rows

will have the front of the servers facing each other

and the rear of the servers facing away from the aisle.

This is because the servers are designed

to push air out the rear of the device.

So the front of the servers is in the cold aisle

and the rear of the servers is in the hot aisle.

This lets us focus our HVAC systems into the hot aisles

to suck that hot air out,

cool it down, and return it back to the cold aisle,

where it can then be circulated over the servers once again.

Remember, proper cooling is important to the health

and security of our networks and our devices.

If the network devices start to overheat,

they will shut themselves down

to protect their critical components,

and if those components get overheated for too long,

permanent damage can occur

or it can decrease the life expectancy of those devices.

Now our fifth and final thing we need to discuss

is fire suppression.

In a data center,

we usually have built-in fire suppression systems.

These can include wet pipe sprinklers,

pre-action sprinklers, and special suppression systems.

Now a wet pipe system is the most basic type

of fire suppression system,

and it involves a sprinkler system and pipes

that always contain water in those pipes.

Now in a server room or data center environment,

this is kind of dangerous

because a leak in that pipe could damage your servers

that are sitting underneath them.

In general, you should avoid using a wet pipe system

in and around your data centers.

Instead, you should use a pre-action system

to minimize the risk of accidental release

if you're going to be using a wet pipe system.

With a pre-action system,

both the detector actuator

is going to work like a smoke detector,

and then there's going to be a sprinkler

that has to be tripped also

before the water is going to be released.

Again, using water in a data center,

even in a pre-action system,

is not really a good idea though, so I try to avoid it.

Instead, I like to rely on special suppression systems

for most of my data centers.

This will use something like a clean agent system.

Now a clean agent is something like halocarbon agents

or inert gases,

which released, the agents will displace the oxygen

in the room with that inert gas

and essentially suffocate the fire.

Now, the danger with using

a special suppressant system like this

is that if there's people working in your data center,

those people can suffocate

if the clean agent is being released.

So your data center needs to be equipped with an alarm

that announces when the clean agent is being released,

and you also need to make sure

there's supplemental oxygen masks available

and easily accessible

by any person who's working in that data center

whenever they hear the alarm go off

for that clean agent release.

So remember, when you're designing your data centers

and your primary work environment or your recovery sites,

you need to consider your power,

your cooling, and your fire suppression needs.

Why do we need quality of service or QoS?

Well, nowadays we operate converge networks,

which means all of our networks are carrying voice, data

and video content over the same wire.

We don't have them all separate out like we used to.

We used to have networks for phones and ones for data

and ones for video,

but now everything's riding over the same IP networks.

So, because of this convergence of mediums,

we have these networks

that now have a high level availability

to ensure proper delivery

over all of these different medians,

because we want a phone to work

every time we pick up the phone, right?

Well, by using QoS, we can optimize our network

to efficiently utilize all the bandwidth at the right time

to deliver the right service to our users

and give a success and cost savings.

Now, we want to have an excellent quality of service,

an excellent service for our customers,

and that's what we're going to start doing by using QoS.

So what exactly is QoS?

Well, quality of service enables us

to strategically optimize our network performance

based on different types of traffic.

Previously, we talked about the fact

that we want to categorize our different traffic types.

I might have web traffic and voice traffic and video traffic

and email traffic.

And by categorizing it

and identifying these different types of traffic,

I can then prioritize that traffic and route it differently.

So I might determine how much bandwidth is required

for each of those types of traffic.

And I can efficiently use my wide area network links

and all that bandwidth available, for maximum utilization,

and save me bandwidth costs over time.

This can help me identify

the types of traffic that I should drop

whenever there's going to be some kind of congestion,

because if you look at the average load,

there's always going to be some peaks and some valleys.

And so we want to be able to figure that out.

So for example, here on the screen,

you can see the peaks and the valleys.

The peaks over time,

and we need to be able to categorize things

to fit within our bandwidth limitations.

So for example, if we have things like VoIP,

or voice over IP, or video service,

they need to have a higher priority,

because if I'm talking to you on a phone,

I don't want a high amount of latency.

From checking my bank balance, for instance, though,

I can wait another half a second for the web page to load.

From listening to you talk, that half a second delay

starts sending like an echo,

and it gives me a horrible service level.

So we want to be able to solve that,

and to do that, we use quality of service.

Now there are different categories of quality of service.

There are three big ones known as delay, jitter and drops.

When I talk about delay,

this happens when you look at the time

that a packet travels from the source to the destination,

this is measured in milliseconds,

and it's not a big deal if you're dealing with data traffic,

but if you're dealing with voice or video,

delay is an especially big thing,

especially if you're doing things live,

like talking on the phone or doing a live stream,

or something like that.

Now, jitter is an uneven arrival of packets,

and this is especially bad in voiceover IP traffic,

because you're using something like UDP.

And so if I sing something to you, like, "my name is Jason,"

and you got "Jason my name is,"

it sounds kind of weird, right?

Now, usually it's not big chunks like that,

but instead it's little bits

and you'll hear these glick and glock sounds

that make it jumble up because of that jitter.

And this really sounds bad, and it's a bad user experience

if you're using voiceover IP.

And so jitter is a really bad thing

when you're dealing with voice and video.

Now, the third thing we have is what's known as a drop.

Drops are going to occur during network congestion.

When the network becomes too congested,

the router simply can't keep up with demand,

and the queue starts overflowing,

and it'll start dropping packets.

This is the way it deals with packet loss,

and if you're using TCP, it'll just send it again.

But again, if I'm dealing with VoIP, VoIP is usually UDP.

And so if we're talking

and all of a sudden my voice cuts out like that,

that would be bad too.

That's why we don't want to have packet drop on a VoIP call.

And so we want to make sure that that doesn't happen.

These network drops are something that can be avoided

by doing the proper quality of service as well.

So when we deal with this,

we have to think about effective bandwidth.

What is our effective bandwidth?

This is an important concept.

So let's look at this client and this server.

There's probably a lot more to this network

than what I'm showing you here on the screen,

but I've simplified it down for this example.

Here, you can see I have my client on the left,

and he wants to talk to the server.

So he goes up through the switch,

which uses 100 megabit per second Cat-5 cable.

Then he goes through a WAN link

over a 256 kilobit per second connection

because he's using an old DSL line.

Then that connects from that ISP over a T1 connection

to another router.

That router connects to an E1 connection to another router.

And from that router, it goes down a WAN link

over a 512 kilobit per second connection,

and then down to a switch with a gigabit connection,

down to the server.

Now, what is my effective bandwidth?

Well, it's 256 kilobits per second,

because no matter how fast any of the other links are,

whatever the lowest link is inside of this connection,

that is going to be your effective bandwidth.

So we talked about quality of service categories,

in our next lesson, we're going to be talking about

how we can alleviate this problem

of this effective bandwidth, and try to get more out of it,

because we need to be able

to increase our available bandwidth, but in this example,

we're limited to 256 kilobits,

which is going to be really, really slow for us.

Now, I like to think about effective bandwidth

like water flowing through pipes.

I can have big pipes and I can have little pipes.

And if I have little pipes,

I'm going to get less water per second through it

than if I have a really big pipe.

And so this is the idea, if you think about a big funnel,

it can start to back up on us, right?

That's the concept,

And we have to figure out how we can fix that

by using quality of service effectively,

which we're going to discuss more in the next video.

When we deal with the quality of service categorization,

we first have to ask,

what is the purpose of quality of service?

Now, the purpose of quality of service is all about

categorizing your traffic and putting it into buckets

so we can apply a policy to certain buckets

based on those traffic categories

and then we can prioritize them based on that.

I like to tell stories and use analogies in my classes

to help drive home points.

And so, since we're talking about

quality of service and traffic,

I think it's important to talk about real-world traffic.

I live in the Baltimore, Washington D.C area.

This area is known for having

some really really bad traffic.

Now, to alleviate this they applied the idea

of quality of service to their traffic system.

They have three different categories of cars.

They have the first car, which is general public.

Anybody who gets in the road and starts driving,

they are part of this group.

Then there's another category

called high occupancy vehicles or HOV.

And so, if I'm driving my car

and I have at least two other passengers with me,

I can get into special HOV only lanes

and I can go a little bit faster.

Now the third bucket is toll roads or pay roads.

And you have to pay to get on these roads.

And based on the time of day

and the amount of traffic there is,

they actually increase or decrease the price.

Now, if it's during rush hour, you might pay 5 or $10

to get in one of those special toll lanes.

But, they're going to travel a whole lot faster

than the regular general commuter lanes or those HOV lanes.

Now, what does this mean in terms of quality of service?

Well, it's really the same thing.

We take our traffic and we go, okay, this is web traffic,

and this is email traffic,

and this is voice or video traffic.

And based on those buckets we assign a priority to them.

And we let certain traffic go first

and we let it get there faster.

Now, when we categorize this traffic

we start to determine our network performance based on it.

We can start figuring out the requirements

based on the different traffic types

and whether it's voice or video or data.

If we're starting to deal with voice or video

because there are things like streaming media

especially in real-time like a Skype call

or a Voice over IP service,

I want to have a very low delay

and therefore a higher priority.

This way I can do this stuff

for streaming media and voice services

and prevent those jitters and drops and things like that

that we talked about before.

Now, this is something that I want to make sure

has a good high priority so I can get it through.

Instead if I have something with a low priority

that might be something like web browsing

or non-mission critical data.

For instance, if my employees are surfing on Facebook,

that would be a very low priority.

Or if I deal with email,

email is generally a low priority

when it comes to quality of service.

Now why is that, isn't email important to you?

Well, because most email is done

as a store and forward communication method.

This means when I send email,

it can sit on my server for 5 or 10 minutes

before it's actually sent out to the end-user

and they'll never realize it.

So that's okay.

It can be a low priority, it'll still get there eventually.

But if I did the same thing with VoIP traffic,

even delaying it by half a second or a second,

you're going to hear jitters and bumps and echoes

and that would be a horrible service.

So, we want to make sure you get high quality of service

for VoIP and lower priority for email.

Now that's just the way we have it set up.

You can have it set up however you want

as long as you understand

what your quality of service policy is,

and you understand it, and your users understand it,

this is going to be okay.

The best way to do that is to document it

and share that with your users.

You want to make sure your users understand your policy

because this will help make sure

that they don't have problems

and start reporting that back to your service desk.

You can do this by posting it to your internal website.

You might post as part of your indoctrination paperwork

or whatever method you want.

You want to make sure those users understand it

because they're the ones who are going to be there

surfing Facebook or watching YouTube.

If you've categorized as a low priority,

they're going to think something's broken.

But if they know it's a low priority,

they understand it's not broken

it's just your corporate policy.

Now, if they're going to be surfing

something on the web that's mission critical,

that's a higher priority and it's going to get

preferential treatment with your quality of service,

they should know that too.

This is the idea here.

We have to make sure that they understand

how we categorize our traffic

and what categories those get put into.

Now, what are some ways that we can categorize our traffic?

Well, there's really three different mechanisms you can use.

We have best effort, integrated services,

and differentiated services.

Now, when we use best effort

this is when we don't have any quality of service at all

and so traffic is just first in, first out,

every man for himself.

We're going to do our best and just try to get it there.

There's really no reordering of packets.

There's no shaping.

It's just pretty much new quality of service.

First in, first out, best effort.

The second type is known as integrated services or IntServ.

This is also known as hard QoS.

There are different names for it

depending on what company you're using

and what routers and switches you're using.

But the idea here is,

we're going to make strict bandwidth reservations.

We might say that all web traffic

is going to get 50% of our bandwidth,

VoIP service is going to get 25%,

and video service is going to get the remaining 25%.

Now, by reserving bandwidth

for each of these signaling devices,

we now decide how much is going to be there

for each of those three categories.

Now, when we do a DiffServ or differentiated services,

also known as soft QoS,

those percentages become more of a suggestion.

There's going to be this differentiation

between different data types

but for each of these packets,

it's going to be marked its own way.

The routers and switches can then make decisions

based on those markings

and they can fluctuate traffic a little bit as they need to.

Now, this is referred to as soft QoS

because even though we set web up as maybe 50%,

it's not as much web browsing going on right now

we can actually take away some of that 50%

and give it over to VoIP and increase that from 25 to 35%.

This way, when somebody wants to browse the web,

we can then take back that extra from VoIP

and give it back to web back to that 50% was originally had

based on those markings and based on those categories.

Now, if we were using hard QoS or that integrated services,

even if we allocate 50% for web browsing

and nobody's using web browsing,

we're still going to have 50% sitting there

waiting to serve people for web browsing.

And that's why a lot of companies prefer to use soft QS.

Now, let's take a look at it like this

because I like to use simple charts and graphs

to try to make it easy to understand.

With best effort at the top,

you have no strict policies at all.

And basically, you just make your best effort

at providing everyone a good quality of service.

Now with DiffServ you have less strict policies,

also known as soft QS.

Now it's better than the best effort approach

but it's still not the most efficient

or effective method of providing a good quality of service

to those who really need it.

Now with IntServ approaches

you're going to have more of a hard QoS limit.

This is what we've talked about before.

Now, this is going to give you the highest level of service

to those within strict policies.

And if you need a really strong quality of service level

then IntServ or hard QoS with a strict policies

can really ensure that you get it.

Now, the way I like to look at this

is as bundles of QoS options that we can choose from.

So which of these bundles is really the best?

Well, it depends.

It depends on your network and it depends on your needs.

But most of the time, it's not going to be a best effort

because that's usually going to give you

not as much quality as you're really going to want here.

Now, when we start categorizing our traffic out there

we're going to start using these different mechanisms,

either soft or hard QS, for doing that.

And we can do that using classification and marking.

We can do it through congestion management

and congestion avoidance.

We can use policing and shaping.

And we can also use link efficiency.

All of these choices fall under a soft QoS or hard QoS

depending on your configuration that you've set up

within your network appliances, firewalls, or routers.

As I mentioned before,

we have different ways of categorizing our traffic.

We can do it through classification, marking,

utilizing congestion management, congestion avoidance,

policing and shaping, and link efficiency.

All of these ways, are ways for us to help implement

our quality of service and take us from this to this.

Now, as you can see,

we want to start shaping out those peaks and valleys

using these different mechanisms

to give us a better quality of service.

Now, when we look at the classification of traffic,

traffic is going to be placed

into these different categories.

Now, this is going to be done

based on the type of traffic that it is.

There's email, but even inside of email,

we have many different classes

of information inside of an email.

If you think about email,

we have POP3 traffic, we have IMAP traffic.

We have SMTP traffic. We have Exchange traffic.

Those are four different types right there.

And so we can look at the headers

and we can look at the packet type of information

and we can even use the ports that are being used.

And then we can determine what services

need higher or less priority.

We can then do this, not just across email,

but across all of our traffic.

And by doing this, this classification

doesn't alter any bits in the frame itself or the packet.

Instead, there is no marking inside of there.

It's all based on the analysis of the packet itself,

the ports and the protocols used,

and our switches and routers are going to implement QoS

based on that information.

Now, another way to do this, is by marking that traffic.

With this, we're going to alter the bits within the frame.

Now we can do this inside frames, cells, or packets,

depending on what networks we're using.

And this will indicate how we handle this piece of traffic.

Our network tools are going to make decisions

based on those markings.

If you look at the type of service header,

it's going to have a byte of information or eight bits.

The first three of that eight bits is the IP Precedence.

The next six of that is going to be

the differential control protocol or DSP.

Now you don't need to memorize

how this type of service is done inside the header.

But I do want you to remember one of the ways

that we can do this quality service

is by marking and altering that traffic.

Next, we have congestion management.

And when a device receives traffic

faster than it can be transmitted,

it's going to end up buffering that extra traffic

until bandwidth becomes available.

This is known as queuing.

The queuing algorithm is going to empty the packets

in a specified sequence and ML.

These algorithms are going to use one of three mechanisms.

There is a weighted fair queuing.

There's a low-latency queuing,

or there is a weighted round-robin.

Now let's look at this example I have here.

I have four categories of traffic:

Traffic 1, 2, 3, and 4.

It really doesn't matter what kind of traffic it is,

for our example right now,

we just need to know that there's four categories.

Now, if we're going to be using a weighted fair queuing,

how are we going to start emptying these piles of traffic?

Well, I'm going to take one from 1, one from 2,

one from 3, and one from 4.

Then I'm going to go back to 1 and 2 and 3 and 4.

And we'll just keep taking turns.

Now, is that a good mechanism?

Well, maybe. It depends on what your traffic is.

If column 1, for example, was representing VoIP traffic,

this actually, isn't a very good mechanism,

because it has us to keep waiting for our turn.

So instead, let's look at this low-latency queuing instead.

Based on our categories of 1, 2, 3, and 4,

we're going to assign priorities to them.

If 1 was a higher priority than 2,

then all of 1 would get emptied,

then all of 2 would get emptied,

and then all 3 and then all of 4.

Now this works well to prioritize things like

voice and video.

But if you're sitting in category 3 or 4,

you might start really receiving

a lot of timeouts and drop packets

because it's never going to be your turn.

And you're just going to wait and wait and wait.

Now the next one we have is called the weighted round-robin.

And this is actually one of my favorites.

This is kind of a hybrid between the other two.

Now with a weighted round-robin,

we might say that category 1 is VoIP,

and category 2 is video, category 3 is web,

and category 4 is email.

And so we might say that in the priority order,

1 is going to be highest

and we're going to use a weighted round-robin,

and we might say, we're going to take three

out of category 1, two out of category 2,

and then one out of 3 and one out of 4.

And we'll keep going around that way.

We'll take three, two, one, one, three, two, one, one.

And we keep going.

That way, VoIP traffic is getting a lot of priority.

Video is getting the second highest priority.

And then we start looking at web and email

at the bottom of the barrel,

but they're still getting a turn

every couple of rounds here.

And so that way it becomes a weighted round-robin.

As I said, this is the quality of service mechanism

that I really like to implement inside my own networks.

Next, we have the idea of congestion avoidance.

As new packets keep arriving, they can be discarded

if the output queue is already filled up.

Now, I like to think about this as a bucket.

As you can see here, I have a cylinder on the bottom

and it has a minimum and a maximum.

Now, if it's already at maximum and you try

to put more into the bucket,

it just overflows over the top.

Now to help prevent this, we have what's called

the RED or random early detection.

This is used to prevent this overflow from happening for us.

As the queue starts approaching that maximum,

we have this possibility

that discard is going to happen.

And so we start doing is we start dropping traffic.

Instead of just dropping traffic randomly,

we're going to drop it based on priority,

with the lowest traffic priority getting dropped first.

RED is going to drop packets from the selected queues

based on their defined limits.

Now I might start dropping TCP traffic first

because I know it'll retransmit itself.

Where UDP, if you drop it, it's gone forever.

And so I might keep that in my queue a little bit longer,

so it doesn't get dropped.

Now, that's the idea here with TCP traffic,

even if I drop it, we're going to get that retransmission

and we'll try again.

But with UDP, if it dropped,

you're never going to know about it,

and you're going to have loss of service.

Now, when you're dealing with congestion avoidance,

we're going to try to use the buffer

to our advantage, and be able to use it to help us

get more bandwidth through.

Now, when we start putting all these things together,

we start getting into these two concepts,

known as policing and shaping.

Policing is going to discard packets

that exceed the configured rate limit,

which we like to refer to as our speed limit.

Just like if you're driving down the highway too fast,

you're going to get pulled over by a cop

and you're going to get a ticket.

That's what policing is going to do for us.

Now, we're just going to go and drop you off the network

anytime you're going too fast.

So, drop packets are going to result in retransmissions,

which then creates more bandwidth.

Therefore, policing is only good

for very high-speed interfaces.

If you're using a dial up modem or an ISDN connection,

or even a T1, you probably don't want to use policing.

You're much better off using our second method,

which is known as shaping.

Now, what shaping is going to do for us

is it's going to allow the buffer

to delay traffic from exceeding the configured rate.

Instead of dropping those packets like we did in policing,

we're just going to hold them in our buffer.

Then when it's empty and there's space available,

we're going to start pushing it

over that empty space and start shaping out the packets.

This is why we call it shaping or packet shaping.

Now you can see what this looks like here on the screen.

I have traffic at the top,

and you'll see all those jagged lines going down.

Now, what really happens here in your network

is there's this high period of time,

and there's low periods of time,

because not everything is happening

all the time in an equal amount.

If we do policing, all we did was chop off the tops,

which gave us more retransmissions and was shaping.

Instead, we're going to start filling

in from the bottom, from our queue.

So it keeps up there right towards the speed limit

without going over it.

Again, shaping does a better job

of maximizing your bandwidth,

especially on slow speed interfaces,

like a T1 connection, a dial up,

satellite connections, or ISDN.

Then the last thing we need to talk about here

is link efficiency.

Now there's a couple of things we need to mention

in regard to link efficiency.

The first of which is compression.

To get the most out of your link,

you want to make it the most efficient possible.

And so to do that, we can compress our packets.

If we take our payloads and we compress it down,

that's going to conserve bandwidth

because it's less ones and zeros

that need to go across the wire.

VoIP is a great thing that you can compress

because there's so much extra space

that's wasted inside of voice traffic.

VoIP payloads can actually be reduced

by up to 50% of their original space.

We could take it down from 40 bytes

down to 20 bytes by using compression.

If you think that's good, look at the VoIP header.

I can compress the VoIP header down

from 90 or 95% of its original value.

I can take it from 40 bites down to just two to four bytes.

To do this, we use something called compressed RTP or cRTP.

Now, when I have the original VoIP payload,

as you can see here, I have an IP address,

I have UDP as my packet type,

and I have RTP for its header.

And then I have my voice payload.

I can compress all of that down into just a cRTP,

which consolidates the IP, the UDP,

and the RTP altogether into one.

The voice payload can also be squeezed down

to about half of its size.

Now you're not going to notice a big difference

in your audio quality either by doing this,

this can be utilized on slower speed links

to make the most of your limited bandwidth.

And it's not just for VoIP.

You can do this with other types of data too.

Compression is a great thing to use.

They have devices out there called WAN accelerators.

That focus specifically on compressing your data

before sending it out your WAN link.

The last thing I want to talk about here

is what we call LFI, which is another method

to make more efficient use of your links.

This is known as link fragmentation and interleaving.

Now what this does is if you have a really big packet,

it'll start chopping those up

and take those big packets and fragment them,

and then interleave smaller packets in between them.

This way, it's going to allow you to utilize

those slower speed links to make the most

of your limited bandwidth.

Notice here I have three voice packets,

and one big chunk of data.

Now what the router would do, is it's going to chop up

and put that one small voice piece

and then one small data piece,

and then one small voice piece,

and one small data piece.

That way, the voice doesn't suffer

from huge latency by waiting for that big piece

of data to go through first.

By doing this fragmentation and interleaving,

it allows you to get some of that high priority traffic out

in between those larger data structures as well.

W

Network Availability

Network availability, in this section of the course,

we're going to talk about network availability.

Now, network availability is a measure

of how well a computer network can respond to connectivity

and performance demands that are being placed upon it.

This is usually going to be quantitatively measured

as uptime, where we count the amount of time

the network was up,

and divide that by the total amount of time

covered by the monitoring period.

For example, if I monitor the network for a full year

and I only had five minutes and 16 seconds of downtime

during that year, this would equate to an uptime of 99.999%.

Now this is known as the five nines of availability.

And it's considered the gold standard

in network availability.

This is considered to be an extremely available

and high quality network,

but there is going to be downtime in your networks.

It's a fact of life, devices fail, connections go down,

and incorrect configurations are sometimes going to be applied.

There are a lot of reasons that downtime occurs,

but our goal is to minimize that downtime

and increase our availability.

In order to reach the highest levels of availability,

we need to build in availability and redundancy

into our networks from the beginning.

We also are going to use quality of service

to ensure that our end-users are happy with the networks

and the services that we're providing them.

So in this section on network availability,

we're really going to focus on two domains.

We're going to see domain two,

which is network implementations,

and domain three, network operations.

In here, we're going to talk about two objectives.

Objective 2.2 and objective 3.3.

Now objective 2.2 states, that you must compare and contrast

routing technologies and bandwidth management concepts.

Objective 3.3 states,

that you must explain high availability

and disaster recovery concepts

and summarize which is the best solution.

So let's get started talking

all about the different ways for us

to increase the availability, reliability,

and quality of service within our networks

in this section of the course.

High availability.

In this lesson,

we're going to talk all about high availability.

Now, when we're talking about high availability,

we're really talking about making sure our systems are up

and available.

Availability is going to be measured in what we call uptime

or how many minutes or hours you're up and available

as shown as a percentage.

Usually, you're going to take the amount of minutes

you were up,

divided by the total amount of minutes in a period,

and that gives you a percentage known as uptime.

Now, we try to maintain what is known as the five nines

of availability in most commercial networks.

This is actually really hard because that's 99.999%.

That means I get a maximum of about five minutes

of downtime per year,

which is not a whole lot of downtime.

In some cloud based networks,

they aim for six nines of availability or 99.99999%.

This equates to just 31 seconds of downtime

each and every year.

Now, as you imagine,

I need more than 31 seconds of downtime

or even five minutes of downtime

fueled up patch my servers and install a new hard drive

or put in a new router or switch when one fails.

So, how do I maintain that high level of availability?

Well, I'm going to do that,

by designing my networks to be highly available.

Now, there are two terms you need to understand

and be familiar with,

when we talk about high availability.

There is availability and reliability,

and these are different things.

When I'm talking about availability,

this is concerned with being up and operational.

When I talk about reliability,

I'm concerned with not dropping packets

inside of my network.

If your network is highly available,

but it's not reliable,

it's not a very good network

because it's dropping things all the time

and isn't doing what it's supposed to.

But conversely,

you can have a really highly reliable network,

but if it's not a highly available one,

nobody can use it either because it's down all the time.

So that wouldn't be good either.

So, let's say I had the most highly reliable network

in the entire world,

but it's only up 20 minutes a year.

That's not going to be any good, right?

So, we want to make sure we balance these two things.

We have to aim for good enough in both areas

to meet our business needs based on the available resources

and the amount of money we have to build our networks.

So, when we measure our different network components,

we have to determine how highly available they are.

And we do that through measurement of MTTR and MTBF.

Now, MTTR is the mean time to repair.

This measure is the average time it takes to repair

a network device when it breaks.

After all,

everything is going to break eventually.

So, when a device breaks,

how long does it take for you to fix it?

And how much downtime are you going to experience?

That is what we're trying to measure

when we deal with the mean time to repair.

Now, the mean time between failures or MTBF,

is going to measure the average time

between when a failure happens on a device

and the next failure happens.

Now, these two terms can often be confusing.

So, let me display it on a timeline and explain a little bit

about what they look like in the real world.

Now, let's say I had a system failure

at this first stop sign here on the left side.

Then we resume normal operations because we fix things.

That amount of time, was the time to repair.

Now, if I average all the times to repair

over the entire year for that type of device,

that's going to give me my MTTR,

my mean time to repair, the average time to repair.

Now, on the failure side of things,

we want to measure the failure time from one failure

thus using it, fixing it,

and then another failure happens.

This becomes the time between the failures.

If I average all those together,

I get the average time between failures

or the mean time between failures, MTBF.

Hopefully, you can see the difference here.

Remember, when we're dealing with mean time to repair,

we want this to be a very small number.

When we deal with the mean time between failures,

we want this to be a very large number.

This means,

and for the very small number for mean time to repair,

we can fix things really quickly

and get ourselves back online.

So, the lower the mean time to repair is,

the better the network availability.

Now, on the other hand,

we start talking about mean time between failures,

we want a really long time

because this means that the device has stayed up

and operational for a very long time before they fail.

This is going to give us better network availability,

and overall, it should give us better reliability too.

Now, we don't want a lot of failures here.

And so the more time in between failures,

the better that is for our network.

So, how do we design these networks

to be highly reliable and highly available?

Well, we're going to add redundancy to our networks

and their devices.

Now, redundancy can be achieved through a single device

or by using multiple devices.

If you're using a single device,

you're still going to have single points of failure

in your network,

but it is cheaper than being fully hardware redundant.

Let's take a look at this concept for a moment.

Here you could see a single point of failure in my network.

Even though I have two switches and multiple connections

between those switches,

which gives me additional redundancy,

that router is not giving me additional redundancy.

It's a single point of failure

because it's the only router I have.

So, even if the router has internal hardware redundancy,

like two power supplies and two network cards,

I still only have one router chassis and one circuit board

running in that router.

So, if that router goes down,

this entire network is going to stop.

Therefore, this is considered a single point of failure.

Now instead,

I could redesign the network

and I can increase its redundancy by doing this.

Notice, I now have two PCs that want to talk to each other.

And each of them has dual network interface cards

talking to two different switches.

And each of those switches talks to two different routers.

Everything is connected to everything else

in a mesh topology for these network devices.

This gives me multiple connections between each device

and provides me with link redundancy, component redundancy,

and even inside those devices,

I may have two network cards, two power supplies,

and to every other internal network component there is,

so that I have a very redundant

and highly available network.

Now, if one of those routers needs to be upgraded,

I can take it offline and update its firmware,

and then the entire time that second router

is still on the network,

maintaining the load and providing service to all the users.

Then I can put the first router back on the network,

take off the second router and then do its upgrades.

By doing this and taking turns,

I still am able to have network functions run,

and I have no downtime associated with this network.

This is how you keep a network highly available.

Now, let's talk a little bit more about hardware redundancy.

Inside these routers and other network devices,

we can have hardware redundancy or the devices themselves

could be hardware redundant.

Now, if I have two routers and they're both

serving the same function,

this is considered hardware redundancy.

I could also have hardware redundancy in the components

by having two network cards or two hard drives

or two internal power supplies on a single device.

That way, if one of them fails, the second one takes over.

Usually, you're going to find this

in strategic network devices,

things like your switches, your routers, your firewalls,

and your servers,

because you can't afford a failure

in any one of those devices,

because they would take down large portions

of your network or its services.

On the other hand, if I'm considering my laptop,

I only have one hard drive in it.

If that laptop fails or that hard drive fails,

I would just deal with that downtime.

I might buy a new laptop or a new hard drive

and then restore from an old backup.

That would get me back up and running.

Now, when we're working with end-user devices

like workstations and clients,

we often don't deal with redundancy.

But when you start getting to servers and routers

and switches and firewalls,

you need to start having hardware

and component level redundancy

because these serve lots of end-users.

We deal with this redundancy,

we can then cluster our devices and have them work either

an active-active,

or active-passive configuration.

All right.

Let's assume I have this one computer

and it has two network interface cards

that are connected to the network.

Do I want to talk to both routers at the same time?

Well, if I'm active-active,

then both of those network interface cards

are going to be active at the same time,

and they each are going to have their own Mac address,

and they're going to be talking at the same time

to either of these two routers.

This can then be done to increase the bandwidth

of this computer and load balance

across both network interface cards.

This is known as Network Interface Card teaming,

or NIC teaming,

where a group of network interface cards,

is used for load balancing and failover for a server

or another device like that.

Now, on the other hand,

we can use active-passive,

and this is going to have a primary

and a backup network interface card.

Now, one of these cards is going to be active

and being used at all times.

And when it fails,

the other card is going to go from standby and take over.

In this case,

there is no performance increased by having two cards,

but I have true redundancy and failover capabilities.

In an active-passive configuration,

both NICs are going to be working together

and they're going to have a single Mac address

that they're going to display to the network,

so they look like they're a single device.

Now, when you start looking at redundancy at layer three,

we're going to start talking about our routers.

Now here,

our clients are getting configured with a default gateway,

which is our router by default.

But, if the default gateway went down,

we wouldn't be able to leave the sub-net,

and so it'd be stuck on the internal network.

Now, we don't want that.

So instead,

we want to add some redundancy

and we'll use layer three redundancy

using a virtual gateway.

To create a virtual gateway,

we need to use either

the First Hop Redundancy Protocol, FHRP,

or the Virtual Router Redundancy Protocol, VRRP.

Now, the most commonly used First Hop Redundancy Protocol

is known as HSRP or the Hot Standby Router Protocol.

This is a layer three redundancy protocol

that's used as a proprietary First Hop Redundancy Protocol

in Cisco devices.

HSRP is going to allow for an active and a standby router

to be used together.

And instead,

we get a virtual router that's defined

as our default gateway.

The client devices like the workstations and servers

are then going to be configured to use the virtual router

as its gateway.

When the PC communicates to the IP of the virtual router,

the router will determine which physical router is active

and which one is standby.

And then, it forwards the traffic to that active router.

If the active router goes down,

the standby router will pick up the responsibility

for that active router

until the other router comes back online

and takes over its job again.

Now, with VRRP,

the Virtual Router Redundancy Protocol,

this is one that was created

by the internet engineering task force.

Is an open standard variant

of the Hot Standby Router Protocol or HSRP.

VRRP allows for one master or active router,

and the rest can then be added in a cluster as backups.

Unlike HSRP,

where you can only have one router as active,

and one as standby.

VRRP is going to allow you to have multiple standby routers.

Just like HSRP,

you're going to configure the VRRP

to create a virtual router

that's going to be used as a default gateway

for all of your client devices.

Now, in order to provide load balancing on your networks

and to increase both redundancy

and performance of your networks,

you can use GLBP,

which is the Gateway Load Balancing Protocol,

or you can use LACP the Link Aggregation Control Protocol.

Now, GLBP or the Gateway Load Balancing Protocol

is a Cisco protocol,

and it's another proprietary First Hop Redundancy Protocol.

Now, GLBP will allow us to create a virtual router

and that virtual router will have two routers

being placed behind it,

in active and standby configuration.

The virtual router or gateway will then forward traffic

to the active or standby router

based on which one has the lower current loading,

when the gateway receives that traffic.

If both can support the loading,

then the GLBP will send it to the active

since it's considered the primary device.

By using GLBP,

you can increase the speeds of your network

by using load balancing between two routers or gateways.

Now, the second thing we can use,

is LACP or Link Aggregation Control Protocol.

This is a redundancy protocol that's used at layer two.

So we're going to be using this with switches.

LACP is going to achieve redundancy by having multiple links

between the network devices,

where load balancing over multiple links can occur.

The devices are all going to be considered

part of a single combined link

even when we have multiple links.

This gives us higher speeds and increases our bandwidth.

For example,

let's pretend I have four Cat5 cables.

Each of these are connected to the same switch.

Now, each of those cables has 100 megabits per second

of bandwidth.

Now, if I use the Link Aggregation Control Protocol,

I can bind these altogether and aggregate them

to give me 400 megabits per second

of continuous bandwidth by creating

this one single combined bandwidth,

from those four connections.

Now, let's consider what happens

when that is trying to leave our default gateway

and get out to the internet.

Now in your home,

you probably only have one internet connection,

but for a company,

you may wish to have redundant paths.

For example, at my office,

we have three different internet connections.

The first is a microwave link that operates

at 215 megabits per second for uploads and downloads.

The second, is a cable modem connection.

It operates at 300 megabits per second for downloads

and 30 megabits per second for uploads.

Now the third is a cellular modem,

and that gives me about 100 megabits per second

for downloads and about 30 megabits per second for uploads.

Now, the reason I have multiple connections,

is to provide us with increased speed and redundancy.

So, to achieve this,

I take all three of these connections

and connect them to a single gateway

that's going to act as a load balancer.

If all the connections are up and running,

they're going to load balance my traffic

across all three of those connections

to give me the highest speeds at any given time.

But, if one of those connections drops,

the load balancer will remove it from the pool

until it can be returned to service.

By doing this,

I can get a maximum speed of about 615 megabits per second

for a combined download.

And on the upload,

I can get about 310 megabits per second,

when using all three connections

and they're all functioning in online.

Similarly,

you might be in an area where you can get fiber connections

to your building.

Now, in those cases,

you may purchase a primary and a backup connection.

And if you do,

you should buy them from two different providers.

If both of your connections are coming from the same company

and they go down,

well, guess what?

You just lost both of your connections

because the upstream ISP might be down.

For this reason,

it's always important to have diversity in your path

when you're creating link redundancy,

just like I did in my office.

I have a microwave connection through one ISP.

I have a cable modem through another ISP,

and I have a cellular modem through a third ISP.

That way,

if anyone goes down,

I still have two other paths I can use.

Now, the final type of redundancy that we need to discuss,

is known as multipathing.

Multipathing is used in our storage area networks.

Multipathing is used to create more than one physical path

between the server and its storage devices.

And this allows for better fault tolerance

and performance enhancements.

Basically, think of multipathing

as a form of link aggregation,

but instead of using it for switches,

we're going to use it for our storage area networks.

In the last lesson I showed you a couple of diagrams

of redundant networks,

but one of the things we had to think about in this lesson

are the considerations we have

when we start designing these redundant networks.

First you need to ask yourself,

are you going to use redundancy in the network,

and if so, where, and how?

So are you going to do it from a module or a parts perspective?

For instance, are you going to have multiple power supplies,

multiple network interface devices, multiple hard drives,

or are you going to look at it more from a chassis redundancy

and have two sets of routers or two sets of switches?

These are things you have to think about.

Which one of these are you going to use,

because each one is going to affect the cost

of your network, based on the decisions you make.

You have to be able to make a good business case

for which one you're going to use, and why.

For instance, if you could just have

a second network interface card or a second power supply,

that's going to be a lot cheaper

than having to have an entire switch

or an entire extra router there.

Now, each of those switches and routers,

some of these can cost 3 or 4 or $5,000,

and so it might be a lot cheaper

to have a redundant power supply, right,

and so these are the things you have to think about

and weigh as you're building your networks.

Another thing you have to think about

is software redundancy,

and which features of those are going to be appropriate.

Sometimes you can solve a lot of these redundancy problems

by using software as opposed to hardware.

For example, if you have a virtual network setup,

you could just put in a virtual switch

or a virtual router in there,

and that way you don't have to bring

another real router or real switch in,

that can save you a lot of money.

There's also a lot of other software solutions out there,

like a software RAID,

that will give you additional redundancy

for your storage devices,

as opposed to putting in an extra hard drive chassis,

or another RAID array or storage area network.

Also, these are the types of things

you have to be thinking about

as you're building out your network, right?

When you think about your protocols,

what protocol characteristics

are going to affect your design requirements?

This is really important if you're designing things,

and you're using something like

TCP versus UDP in your designs,

because TCP has that additional redundancy

by resending packets, where UDP doesn't,

this is something you have to consider as well.

As you design all these different things,

all of these different factors are going to work together,

just like gears, and each one turns another,

and each one is going to feed another one,

and the more reliability and availability

you get in your networks

by adding all these components together.

In addition to all this,

there are other design considerations

that we have to think about as well,

like what redundancy features should we use

in terms of powering the infrastructure devices?

Are we going to have internal power supplies

and have two of those, and have them redundant?

Or, are we going to have battery backups, or UPSs,

are we going to have generators?

All of these things are things you have to think about,

and I don't have necessarily the right answers for you,

because it all comes down to a case-by-case basis.

Every network is going to be different,

and every one has its own needs

and its own business case associated with it.

The network that I had at former employers

were serving hundreds of thousands of clients,

and those were vastly different than the ones

that are servicing my training company right now,

with just a handful of employees.

Because when you're dealing with your network design

and your redundancies,

you have to think about the business case first.

Each one is going to be different

based on your needs and your considerations.

What redundancy features should be used

to maintain the environmental conditions of your space?

If you have good power and space and cooling,

you need to make sure

that you're thinking about air conditioning,

and do you have one unit or two?

Do you have generators onsite?

Do you have additional thermal heating or thermal cooling?

All of these things are things you have to think about.

What do you do when power goes down?

What are some of those things

that you're going to have to deal with

if you're running a server farm

that has to have units running all the time,

because it can't afford to go down

because it's going to affect

thousands and thousands of people,

instead of just your one office with 20 people?

All of these are things you have to consider

as you think about it.

In my office, we made the decision

that one air conditioning unit was enough,

because if it goes down, we might just not work today

and we'll come to work tomorrow, we can get over that.

But in a server farm,

we need to make sure we have multiple air conditioners,

because if that goes down

it can actually burn up all the components, right?

So we have to have additional power and space and cooling

that are fully redundant,

because of that server infrastructure

that we're supporting there.

These are the things you have to balance in your practices.

And so when you start looking at the best practices,

I want you to examine your technical goals

and your operational goals.

Now what I mean by that is,

what is the function of this network?

What are you actually trying to accomplish?

Are you trying to get to 90% uptime, or 95%, or 99%,

or are you going for that gold standard

of five nines of availability?

Every company has a different technical goal,

and that technical goal is going to determine

the design of your network.

And you need to identify that

inside of your budgeting as well,

because funding these high-availability features

is really expensive.

As I said, if I want to put a second router in there,

that might cost me another 3,000 or $5,000.

In my own personal network,

we have a file server, and it's a small NAS device.

We're not comfortable just having all of our devices there,

so we decided we weren't comfortable

having all of our file storage on a single hard drive,

and we built this NAS array instead,

so if one of those drives goes out,

we have three others that are carrying the load.

This is the idea here.

Now, eventually we decided we didn't need that NAS anymore,

and so we replaced that NAS enclosure with a full RAID 5.

Later on we took that full RAID 5

and we switched it over to a cloud server

that has redundant backups

in two different cloud environments.

And so all of these things work together

based on our decisions,

but as we moved up that scale

and got more and more redundancy,

we have more and more costs associated.

It was a lot cheaper just to have an 8-terabyte hard drive

with all of our files on it,

then we went to a NAS array

and that cost two or three times that money,

then we went to a full RAID 5

and that cost a couple more times that,

then we went to the cloud and we have to pay more for that.

Remember, all your decisions here

are going to cost you more money,

but if it's worth it to you, that would be important, right,

and so these are the things you have to balance

as you're designing these fully redundant networks,

based on those technical goals.

You also need to categorize

all of your business applications into profiles,

to help with this redundancy mission

that you're trying to go and accomplish here.

This will really help you as you start

going into the quality of service as well.

Now if I said, for instance,

that web is considered category one

and email is category two

and streaming video's going to be category three,

then we can apply profiles

and give certain levels of service

to each of those categories.

Now we'll talk specifically of how that works

when we talk about quality of service in a future lesson.

Another thing we want to do

is establish performance standards

for our high-availability networks.

What are the standards that we're going to have to have?

These standards are going to drive

how success is measured for us,

and in the case of my file server, for instance,

we measure success as it being up and available

when my video editors need to access it,

and that they don't lose data,

because if we lost all of our files,

that'd be bad for us, right?

Those are two metrics that we have,

and we have numbers associated with each of those things.

In other organizations, we measure it based on the uptime

of the entire end-to-end service,

so if a client can't get out to the internet for an ISP,

that would be a bad thing, that's one of their measurements.

Now the other one might be, what is their uptime?

All of these performance standards are developed

through metrics and key performance indicators.

If you're using something like ITIL

as your IT service management standards,

this is what you're going to be doing as you're trying

to run those inside your organization as well.

Finally, here we wanted to find how we manage and measure

the high-availability solutions for ourselves.

Metrics are going to be really useful to quantify success,

if you develop those metrics correctly.

Decision-makers and leaders love seeing metrics.

They love seeing charts and seeing the performance,

and how it's going up over time,

and how our availability is going up,

and how our costs are going down.

Those are all good things,

but if you don't know what you're measuring

or why you're measuring it,

it really goes back to your performance standards.

Then, these are the kind of things

that are wasting your time with metrics.

A lot of people measure a lot of things,

and they don't really tell you

if you're getting the outcome you're wanting.

I want to make sure that you think about

how you decide on what metrics you're going to use.

Now, we've covered a lot of different design criteria

in this lesson, but the real big takeaway here

that I want you to think about is this.

If you have an existing network,

you can add availability to it,

and you can add redundancy to it.

You can retrofit stuff in,

but it's going to cost you a lot more time

and a lot more money.

It is much, much cheaper

to design this stuff early in the process

when you start building a network from scratch.

So, if you're designing a network and you're asked early on

what kind of things you need,

I want you to think about all these things of redundancy

in your initial design.

Adding them in early is going to save you a lot of money.

Every project has three main factors,

time, cost, and quality,

and usually, one of these things is going to suffer

at the expense of the other two.

For example, if I asked you to build me a network

and I want it to be fully redundant

and available by tomorrow, could you do it?

Well, maybe, but it's probably going to cost me a lot of money,

and because they give you very little time,

it's going to cost me even more,

or your quality is going to suffer.

So, you could do it good, you could do it quick,

or you could do it cheap, but you can't do all three.

It's always going to be a trade-off between these three things,

and I want you to remember

as you're out there and you're designing networks,

you need to make sure you're thinking about your redundancy

and your availability and your reliability,

because often that quality is going to suffer

in favor of getting things out quicker

or getting things out cheaper.

Recovery sites.

In this lesson, we're going to discuss the concept

of recovery sites.

After all things are going to break and your networks

are going to go down.

This is just a fact of life.

So what are you going to do when it comes time

to recover your enterprise network?

Well, that's what we're going to discuss in this lesson.

When it comes to designing redundant operations

for your company,

you really should consider a recovery site.

And with recovery sites, you have four options.

You see, you can have all the software and hardware

redundancy you want.

But at the end of the day,

sometimes you need to actually recover your site too.

Now this could be because there's a fire that breaks out

in your building or a hurricane or earthquake.

All of these things might require you to relocate

and if you do, you're going to have to choose

one of four options.

This could be a cold site, a warm site, a hot site

or a cloud site.

Now when we deal with cold sites,

this means that you have a building that's available

for you to use,

but you don't have any hardware or software in place.

And if you do, those things aren't even configured.

So you may have to go out to the store and buy routers

and switches and laptops and servers

and all that kind of stuff.

You're going to bring it to a new building, configure it

and then restore your network.

This means that while recovery is possible,

it's going to be slow and it's going to be time consuming.

If I have to build you out a new network and a cold site,

that means I'm going to need you to bring everything in

after the bad thing has already happened,

such as your building catching fire.

And this can take me weeks or even months

to get you fully backing up and running.

Now, the biggest benefit of using a cold site

is that it is the cheapest option

that we're going to talk about.

The drawbacks are that it is slow and essentially

this is just going to be an empty building

that's waiting for you to move in and start rebuilding.

Now next, we have a warm site.

A warm site means you have the building available

and it already contains a lot of the equipment.

You might not have all your software installed

on these servers or maybe you don't have the latest security

patches or even the data backups from your other site

haven't been recovered here yet.

But you do already have the hardware

and the cabling in place.

With a warm site,

we already have a network that's running the facility.

We have switches and routers and firewalls.

But we may not maintain it fully

each and every day of the year.

So, when a bad event happens

and you need to move into the warm site,

we can load up our configurations on our routers

and switches, install the operating systems on the servers,

restore the files from backup

and usually within a couple of days,

we can get you back up and running.

Normally with a warm site,

we're looking to restore the time between 24 hours

and seven days.

Basically, under a week.

Recovery here is going to be fairly quick,

but not everything from the original site

is going to be there and ready for all employees

at all times.

Now, if speed of recovery is really important to you,

the next type of site is your best choice.

It's known as a hot site.

Now hot site is my personal favorite.

But it's also the most expensive to operate.

With a hot site, you have a building, you have the equipment

and you have the data already on site.

That means everything in the hot site is up and running

all the time.

Ready for you to instantly switch over your operations

from your primary site to your hot site

at the flip of a switch.

This means you need to have the system and network

administrators working at that hut site every day

of the year, keeping it up and running, secured

and patched and ready for us to take over operations

whenever we're told to.

Basically, your people are going to walk out of the old site,

get in their car, drive to the new site, login

and they're back to work as if nothing ever happened.

This is great because there's very minimal downtime.

And you're going to have nearly identical levels of servers

at the main site in the hut site.

But as you can imagine, this costs a lot of money.

Because I have to pay for the building,

two sets of equipment, two sets of software licenses

and all the people to run all this stuff.

You're basically running two sites at all times.

Therefore, a hot site gets really expensive.

Now a hot site is very critical

if you're in a high availability type of situation.

Let's say you work for a credit card processing company.

And every minute they're down cost them millions of dollars.

They would want to have a hot site, right.

They don't want to be down for three or four weeks.

So they have to make sure they have their network up

and available at all times.

Same thing if you're working for the government

or the military,

they always need to make sure they're operating

cause otherwise people could die.

And so they want to make sure that is always up and running.

That's where hot sites are used.

Now if you can get away from those type of criticality

requirements though, which most organizations can.

You're going to end up settling on something like a warm site,

because it's going to save you on the cost of running

that full recovery hot site.

Now the fourth type of site we have

is known as a cloud site.

Now a cloud site isn't exactly a full recovery site,

like a cold warm or hot site is.

In fact, there may be no building for you to move

your operations into.

Instead, a cloud site is a virtual recovery site

that allows you to create a recovery version

of your organization's network in the cloud.

Then if disaster strikes, you can shift all your employees

to telework operations by accessing that cloud site.

Or you can combine that cloud site with a cold or warm site.

This allows you to have a single set of system

administrators and network administrators

that run your day to day operational networks

and they can also run your backup cloud site.

Because they can operate at all

from wherever they're sitting in the world.

Now cloud sites are a good option to use,

but you are going to be paying a cloud service provider

for all the compute time, the storage

and the network access required to use that cloud site

before, during and after the disastrous event.

So, which of these four options should you consider?

Well, that really depends on your organization.

It's recovery time objectives, the RTO

and its recovery point objectives, RPO.

Now the recovery time objective or RTO

is the duration of time and service level

within which a business process has to be restored

after disaster happens in order to avoid unacceptable

consequences associated with a breaking continuity.

In other words, our RTO is going to answer our question,

how much time did it take for the recovery to happen

after the notification of a business process disruption?

So, if you have a very low RTO,

then you're going to have to use either a hot site

or a cloud site because you need to get up and running

quickly.

That is the idea of a low RTO.

Now on the other hand, we have to think about our RPO.

Which is our recovery point objective.

Now RPO is going to be the interval of time that might pass

during the disruption before the quantity of data loss

during that period exceeds the business continuity plans

maximum allowable threshold or tolerance.

Now RPO is going to determine the amount of data

that will be lost or will have to be re-entered

during network operations in downtime.

It symbolizes the amount of data that can be acceptably lost

by the organization.

For example, in my company we have an RPO of 24 hours.

That means if all of our servers crashed right now,

I as the CEO have accepted the fact that I can lose no more

than the last 24 hours worth of data and that would be okay.

To achieve this RPO,

I have daily backups that are conducted every 24 hours.

So, we can ensure we always have our data backed up

and ready for restoral at any time.

And that means we will lose at most 24 hours worth of data.

The RTO that recovery time objective is going to be focused

on the real time that passes during a disruption.

Like if you took out a stopwatch and started counting.

For example, can my business survive

if we're down for 24 hours?

Sure.

It would hurt, we would lose some money, but we can do it.

How about seven days?

Yeah, again, we would lose some money,

we'd have some really angry students,

but we could still survive.

Now, what about 30 days?

No way.

Within 30 days all of my customers and students,

they would have left me.

They would take their certifications

through some other provider out there

and I would be out of business.

So I had to figure out what my RTO someplace between one

and seven days to make me happy.

So that's the idea of operational risk tolerance,

we start thinking about this from an organizational level.

How much downtime are you willing to accept?

Based on my ability to accept seven days,

I could use a warm site instead of a hot site.

But if I currently accept 24 hours of downtime

or five minutes of downtime,

then I would have to use a hot site instead.

RTO is used to designate that amount of real time

that passes on the clock before that disruption

begins to have serious and unacceptable impedances

to the flow of our normal business operations.

That is the whole concept here with RTO.

Now when we start talking about RPO and RTO,

you're going to see this talked about a lot in backups

and recovery as well.

When you deal with backups and recovery,

you a few different types of backups.

We have things like full backups, incremental backups,

differential backups and snapshots.

Now a full backup is just what it sounds like.

It's a complete backup of every single file on a machine.

It is the safest and most comprehensive backup method,

but it's also the most time consuming and costly.

It's going to take up the most disk space

and the most time to run.

This is normally going to be run on your servers.

Now another type of backup we have

is known as an incremental backup.

With an incremental backup, I'm going to back up the data

that changed since the last backup.

So, if I did a full backup on Sunday

and I go to do an incremental backup on Monday,

I'm only going to back up the things that have changed

since doing that full backup on Sunday.

Now another type we have is known as a differential backup.

A differential backup is only going to back up the data

since the last full backup.

So, let's go back to my example

of Sunday being a full backup

and then I did an incremental backup on Monday.

Then that backup is going to copy everything since Sunday.

But if I do an incremental on Tuesday, it's only going to do

the difference between Monday and Tuesday.

Cause Monday was the last backup on the incremental backup.

When I do it Wednesday,

I'm going to get from Tuesday to Wednesday.

And so when I do these incrementals,

I now have a bunch of smaller pieces

that to put back together when I want to restore my servers.

Now at differential on the other hand is going to be

the entire difference since the last full backup.

So if on Wednesday I did a differential backup,

I'm going to have all the data that's different from Sunday,

the last full backup all the way up through Wednesday.

This is the difference between the differential

and an incremental.

So if I do a full backup on Sunday

and then I do a differential on Monday.

Monday I did an incremental and the differential,

they're going to look the exact same.

But on Tuesday the incremental is only going to include

the stuff since Monday.

But the differential will include everything since Sunday.

This includes all of Monday and Tuesdays changes.

And so you can see how this differential is going to grow

throughout the week until I do another full backup

on the next Sunday.

Now I do an incremental, it's only that last 24 hour period.

Now the last type of backup we have is known as a snapshot.

Now if you're using virtualization

and you're using virtual machines,

this becomes a read only copy of your data frozen in time.

For example, I use snapshots a lot when I'm using virtual

machines or I'm doing malware analysis.

I can take a snapshot on my machine,

which is a frozen instant time.

And then I can load the malware and all the bad things

I need to do.

And then once I'm done doing that,

I can restore back to that snapshot which was clean

before I installed all the malware.

This allows me to do dynamic analysis of it.

Now if you have a very large Sand array or storage area

or network array,

you can take snapshots of your servers

and your virtual machines in a very quick and easy way

and then you'll be able to restore them exactly back

to the way they were at any given moment in time.

Now when we use full, incremental and differential,

most of the time those are going to be used with tape backups

and offsite storage.

But if you're going to be doing snapshots,

that's usually done to a disc like a storage area array.

Now, in addition to conducting your backups of your servers,

it's also important to conduct backups

of your network devices.

This includes their state and their configurations.

The state of a network device contains all the configuration

and dynamic information from a network device

at any given time.

If you export the state of a network device,

it can later be restored to the exact same device

or another device of the same model.

Similarly, you can backup just the configuration information

by conducting a backup of the network device configuration.

This can be done using the command line interface

on the device or using third-party tools.

For example, one organization I worked for

had thousands of network devices.

So we didn't want to go around and do a weekly configuration

backup for all those devices individually.

Instead, we configure them to do that using the tool

known as SolarWinds.

Now once a week, the SolarWinds tool would back up

all the configurations and store them

on a centralized server.

This way, if we ever had a network device that failed,

we could quickly install a spare from our inventory,

restore the configurations from SolarWinds

back to that device and we will be back online

in just a couple of minutes.

Facilities support.

In this lesson, we're going to discuss the concept

of facilities and infrastructure support

for our data centers and our recovery sites.

To provide proper facility support,

it's important to consider power, cooling,

and fire suppression.

So we're going to cover uninterrupted power supplies,

power distribution units, generators, HVAC,

and fire suppression systems in this lesson.

First, we have a UPS, or uninterruptible power supply.

Now an uninterruptible power supply,

or uninterruptible power source,

is an electrical apparatus

that provides emergency power to a load

whenever the input power source or main power

is going to fail.

Most people think of these as battery backups,

but in our data centers and telecommunication closets,

we usually see devices

that contain more than just a battery backup.

For our purposes, we're going to use an UPS

that is going to provide line conditioning

and protect us from surges and spikes in power.

Our goal in using an UPS

is to make sure that we have clean reliable power.

Now an UPS is great for short duration power outages,

but they usually don't last more than about 15 to 30 minutes

because they have a relatively short battery life.

The good news is the batteries

are getting better and better every day

and their lives are getting longer and longer

in newer units.

Second, we have power distribution units or PDUs.

Now a power distribution unit

is a device fitted with multiple outputs

designed to distribute electrical power,

especially to racks of computers

and networking equipment located within our data centers.

PDUs can be rack-mounted

or they can take the form of a large cabinet.

In large data center,

you're usually going to see these large cabinets,

and in general,

there's going to be one PDU for each row of servers

and it maintains the high current circuits,

circuit breakers,

and power monitoring panels inside of them.

These PDUs can provide power protection from surges,

spikes, and brownouts,

but they are not designed

to provide full blackout protection like an UPS would

because they don't have battery backups.

Generally, a PDU be combined with an UPS or a generator

to provide that power that is needed during a blackout.

Third, we have generators.

Now large generators are usually going to be installed

outside of a data center

in order to provide us with longterm power

during a power outage inside your region.

These generators can be powered by diesel,

gasoline, or propane.

For example, at my office,

I have a 20,000 kilowatt diesel generator

that's used to provide power in case we have a power outage.

Now the big challenge with a generator though,

is that they take time to get up to speed

until they're ready to start providing power

to your devices.

They can take usually between 45 to 90 seconds.

So you usually need to pair them up

with a battery backup or UPS

as you're designing your power redundancy solution.

For example, at my office, if the power goes out,

the UPS will carry the load for up to 15 minutes.

During that time,

the generator will automatically be brought online,

usually taking 45 to 90 seconds.

Once that generator is fully online,

and providing the right stable power,

and it's ready to take the load,

the power gets shifted

from the UPS batteries to the generator,

using an automatic transfer switch or ATS.

Now once the power is restored in our area

for at least five minutes being steady,

then our ATS will actually shift power back to the grid

through our UPS unit, that battery backup,

and then shut down our generator.

Fourth, we have HVAC units.

HVAC stands for heating, ventilation, and air conditioning.

Our data centers

are going to generate a ton of heat inside of them

because of all these servers, and switches,

and routers, and firewalls,

that are doing processing inside of them.

To cool down these devices,

we need to have a good HVAC system.

Now to help with this cooling,

most data centers are going to utilize

a hot and cold aisle concept.

Now in the simplest form,

each row of servers is going to face another row of servers.

These two rows

will have the front of the servers facing each other

and the rear of the servers facing away from the aisle.

This is because the servers are designed

to push air out the rear of the device.

So the front of the servers is in the cold aisle

and the rear of the servers is in the hot aisle.

This lets us focus our HVAC systems into the hot aisles

to suck that hot air out,

cool it down, and return it back to the cold aisle,

where it can then be circulated over the servers once again.

Remember, proper cooling is important to the health

and security of our networks and our devices.

If the network devices start to overheat,

they will shut themselves down

to protect their critical components,

and if those components get overheated for too long,

permanent damage can occur

or it can decrease the life expectancy of those devices.

Now our fifth and final thing we need to discuss

is fire suppression.

In a data center,

we usually have built-in fire suppression systems.

These can include wet pipe sprinklers,

pre-action sprinklers, and special suppression systems.

Now a wet pipe system is the most basic type

of fire suppression system,

and it involves a sprinkler system and pipes

that always contain water in those pipes.

Now in a server room or data center environment,

this is kind of dangerous

because a leak in that pipe could damage your servers

that are sitting underneath them.

In general, you should avoid using a wet pipe system

in and around your data centers.

Instead, you should use a pre-action system

to minimize the risk of accidental release

if you're going to be using a wet pipe system.

With a pre-action system,

both the detector actuator

is going to work like a smoke detector,

and then there's going to be a sprinkler

that has to be tripped also

before the water is going to be released.

Again, using water in a data center,

even in a pre-action system,

is not really a good idea though, so I try to avoid it.

Instead, I like to rely on special suppression systems

for most of my data centers.

This will use something like a clean agent system.

Now a clean agent is something like halocarbon agents

or inert gases,

which released, the agents will displace the oxygen

in the room with that inert gas

and essentially suffocate the fire.

Now, the danger with using

a special suppressant system like this

is that if there's people working in your data center,

those people can suffocate

if the clean agent is being released.

So your data center needs to be equipped with an alarm

that announces when the clean agent is being released,

and you also need to make sure

there's supplemental oxygen masks available

and easily accessible

by any person who's working in that data center

whenever they hear the alarm go off

for that clean agent release.

So remember, when you're designing your data centers

and your primary work environment or your recovery sites,

you need to consider your power,

your cooling, and your fire suppression needs.

Why do we need quality of service or QoS?

Well, nowadays we operate converge networks,

which means all of our networks are carrying voice, data

and video content over the same wire.

We don't have them all separate out like we used to.

We used to have networks for phones and ones for data

and ones for video,

but now everything's riding over the same IP networks.

So, because of this convergence of mediums,

we have these networks

that now have a high level availability

to ensure proper delivery

over all of these different medians,

because we want a phone to work

every time we pick up the phone, right?

Well, by using QoS, we can optimize our network

to efficiently utilize all the bandwidth at the right time

to deliver the right service to our users

and give a success and cost savings.

Now, we want to have an excellent quality of service,

an excellent service for our customers,

and that's what we're going to start doing by using QoS.

So what exactly is QoS?

Well, quality of service enables us

to strategically optimize our network performance

based on different types of traffic.

Previously, we talked about the fact

that we want to categorize our different traffic types.

I might have web traffic and voice traffic and video traffic

and email traffic.

And by categorizing it

and identifying these different types of traffic,

I can then prioritize that traffic and route it differently.

So I might determine how much bandwidth is required

for each of those types of traffic.

And I can efficiently use my wide area network links

and all that bandwidth available, for maximum utilization,

and save me bandwidth costs over time.

This can help me identify

the types of traffic that I should drop

whenever there's going to be some kind of congestion,

because if you look at the average load,

there's always going to be some peaks and some valleys.

And so we want to be able to figure that out.

So for example, here on the screen,

you can see the peaks and the valleys.

The peaks over time,

and we need to be able to categorize things

to fit within our bandwidth limitations.

So for example, if we have things like VoIP,

or voice over IP, or video service,

they need to have a higher priority,

because if I'm talking to you on a phone,

I don't want a high amount of latency.

From checking my bank balance, for instance, though,

I can wait another half a second for the web page to load.

From listening to you talk, that half a second delay

starts sending like an echo,

and it gives me a horrible service level.

So we want to be able to solve that,

and to do that, we use quality of service.

Now there are different categories of quality of service.

There are three big ones known as delay, jitter and drops.

When I talk about delay,

this happens when you look at the time

that a packet travels from the source to the destination,

this is measured in milliseconds,

and it's not a big deal if you're dealing with data traffic,

but if you're dealing with voice or video,

delay is an especially big thing,

especially if you're doing things live,

like talking on the phone or doing a live stream,

or something like that.

Now, jitter is an uneven arrival of packets,

and this is especially bad in voiceover IP traffic,

because you're using something like UDP.

And so if I sing something to you, like, "my name is Jason,"

and you got "Jason my name is,"

it sounds kind of weird, right?

Now, usually it's not big chunks like that,

but instead it's little bits

and you'll hear these glick and glock sounds

that make it jumble up because of that jitter.

And this really sounds bad, and it's a bad user experience

if you're using voiceover IP.

And so jitter is a really bad thing

when you're dealing with voice and video.

Now, the third thing we have is what's known as a drop.

Drops are going to occur during network congestion.

When the network becomes too congested,

the router simply can't keep up with demand,

and the queue starts overflowing,

and it'll start dropping packets.

This is the way it deals with packet loss,

and if you're using TCP, it'll just send it again.

But again, if I'm dealing with VoIP, VoIP is usually UDP.

And so if we're talking

and all of a sudden my voice cuts out like that,

that would be bad too.

That's why we don't want to have packet drop on a VoIP call.

And so we want to make sure that that doesn't happen.

These network drops are something that can be avoided

by doing the proper quality of service as well.

So when we deal with this,

we have to think about effective bandwidth.

What is our effective bandwidth?

This is an important concept.

So let's look at this client and this server.

There's probably a lot more to this network

than what I'm showing you here on the screen,

but I've simplified it down for this example.

Here, you can see I have my client on the left,

and he wants to talk to the server.

So he goes up through the switch,

which uses 100 megabit per second Cat-5 cable.

Then he goes through a WAN link

over a 256 kilobit per second connection

because he's using an old DSL line.

Then that connects from that ISP over a T1 connection

to another router.

That router connects to an E1 connection to another router.

And from that router, it goes down a WAN link

over a 512 kilobit per second connection,

and then down to a switch with a gigabit connection,

down to the server.

Now, what is my effective bandwidth?

Well, it's 256 kilobits per second,

because no matter how fast any of the other links are,

whatever the lowest link is inside of this connection,

that is going to be your effective bandwidth.

So we talked about quality of service categories,

in our next lesson, we're going to be talking about

how we can alleviate this problem

of this effective bandwidth, and try to get more out of it,

because we need to be able

to increase our available bandwidth, but in this example,

we're limited to 256 kilobits,

which is going to be really, really slow for us.

Now, I like to think about effective bandwidth

like water flowing through pipes.

I can have big pipes and I can have little pipes.

And if I have little pipes,

I'm going to get less water per second through it

than if I have a really big pipe.

And so this is the idea, if you think about a big funnel,

it can start to back up on us, right?

That's the concept,

And we have to figure out how we can fix that

by using quality of service effectively,

which we're going to discuss more in the next video.

When we deal with the quality of service categorization,

we first have to ask,

what is the purpose of quality of service?

Now, the purpose of quality of service is all about

categorizing your traffic and putting it into buckets

so we can apply a policy to certain buckets

based on those traffic categories

and then we can prioritize them based on that.

I like to tell stories and use analogies in my classes

to help drive home points.

And so, since we're talking about

quality of service and traffic,

I think it's important to talk about real-world traffic.

I live in the Baltimore, Washington D.C area.

This area is known for having

some really really bad traffic.

Now, to alleviate this they applied the idea

of quality of service to their traffic system.

They have three different categories of cars.

They have the first car, which is general public.

Anybody who gets in the road and starts driving,

they are part of this group.

Then there's another category

called high occupancy vehicles or HOV.

And so, if I'm driving my car

and I have at least two other passengers with me,

I can get into special HOV only lanes

and I can go a little bit faster.

Now the third bucket is toll roads or pay roads.

And you have to pay to get on these roads.

And based on the time of day

and the amount of traffic there is,

they actually increase or decrease the price.

Now, if it's during rush hour, you might pay 5 or $10

to get in one of those special toll lanes.

But, they're going to travel a whole lot faster

than the regular general commuter lanes or those HOV lanes.

Now, what does this mean in terms of quality of service?

Well, it's really the same thing.

We take our traffic and we go, okay, this is web traffic,

and this is email traffic,

and this is voice or video traffic.

And based on those buckets we assign a priority to them.

And we let certain traffic go first

and we let it get there faster.

Now, when we categorize this traffic

we start to determine our network performance based on it.

We can start figuring out the requirements

based on the different traffic types

and whether it's voice or video or data.

If we're starting to deal with voice or video

because there are things like streaming media

especially in real-time like a Skype call

or a Voice over IP service,

I want to have a very low delay

and therefore a higher priority.

This way I can do this stuff

for streaming media and voice services

and prevent those jitters and drops and things like that

that we talked about before.

Now, this is something that I want to make sure

has a good high priority so I can get it through.

Instead if I have something with a low priority

that might be something like web browsing

or non-mission critical data.

For instance, if my employees are surfing on Facebook,

that would be a very low priority.

Or if I deal with email,

email is generally a low priority

when it comes to quality of service.

Now why is that, isn't email important to you?

Well, because most email is done

as a store and forward communication method.

This means when I send email,

it can sit on my server for 5 or 10 minutes

before it's actually sent out to the end-user

and they'll never realize it.

So that's okay.

It can be a low priority, it'll still get there eventually.

But if I did the same thing with VoIP traffic,

even delaying it by half a second or a second,

you're going to hear jitters and bumps and echoes

and that would be a horrible service.

So, we want to make sure you get high quality of service

for VoIP and lower priority for email.

Now that's just the way we have it set up.

You can have it set up however you want

as long as you understand

what your quality of service policy is,

and you understand it, and your users understand it,

this is going to be okay.

The best way to do that is to document it

and share that with your users.

You want to make sure your users understand your policy

because this will help make sure

that they don't have problems

and start reporting that back to your service desk.

You can do this by posting it to your internal website.

You might post as part of your indoctrination paperwork

or whatever method you want.

You want to make sure those users understand it

because they're the ones who are going to be there

surfing Facebook or watching YouTube.

If you've categorized as a low priority,

they're going to think something's broken.

But if they know it's a low priority,

they understand it's not broken

it's just your corporate policy.

Now, if they're going to be surfing

something on the web that's mission critical,

that's a higher priority and it's going to get

preferential treatment with your quality of service,

they should know that too.

This is the idea here.

We have to make sure that they understand

how we categorize our traffic

and what categories those get put into.

Now, what are some ways that we can categorize our traffic?

Well, there's really three different mechanisms you can use.

We have best effort, integrated services,

and differentiated services.

Now, when we use best effort

this is when we don't have any quality of service at all

and so traffic is just first in, first out,

every man for himself.

We're going to do our best and just try to get it there.

There's really no reordering of packets.

There's no shaping.

It's just pretty much new quality of service.

First in, first out, best effort.

The second type is known as integrated services or IntServ.

This is also known as hard QoS.

There are different names for it

depending on what company you're using

and what routers and switches you're using.

But the idea here is,

we're going to make strict bandwidth reservations.

We might say that all web traffic

is going to get 50% of our bandwidth,

VoIP service is going to get 25%,

and video service is going to get the remaining 25%.

Now, by reserving bandwidth

for each of these signaling devices,

we now decide how much is going to be there

for each of those three categories.

Now, when we do a DiffServ or differentiated services,

also known as soft QoS,

those percentages become more of a suggestion.

There's going to be this differentiation

between different data types

but for each of these packets,

it's going to be marked its own way.

The routers and switches can then make decisions

based on those markings

and they can fluctuate traffic a little bit as they need to.

Now, this is referred to as soft QoS

because even though we set web up as maybe 50%,

it's not as much web browsing going on right now

we can actually take away some of that 50%

and give it over to VoIP and increase that from 25 to 35%.

This way, when somebody wants to browse the web,

we can then take back that extra from VoIP

and give it back to web back to that 50% was originally had

based on those markings and based on those categories.

Now, if we were using hard QoS or that integrated services,

even if we allocate 50% for web browsing

and nobody's using web browsing,

we're still going to have 50% sitting there

waiting to serve people for web browsing.

And that's why a lot of companies prefer to use soft QS.

Now, let's take a look at it like this

because I like to use simple charts and graphs

to try to make it easy to understand.

With best effort at the top,

you have no strict policies at all.

And basically, you just make your best effort

at providing everyone a good quality of service.

Now with DiffServ you have less strict policies,

also known as soft QS.

Now it's better than the best effort approach

but it's still not the most efficient

or effective method of providing a good quality of service

to those who really need it.

Now with IntServ approaches

you're going to have more of a hard QoS limit.

This is what we've talked about before.

Now, this is going to give you the highest level of service

to those within strict policies.

And if you need a really strong quality of service level

then IntServ or hard QoS with a strict policies

can really ensure that you get it.

Now, the way I like to look at this

is as bundles of QoS options that we can choose from.

So which of these bundles is really the best?

Well, it depends.

It depends on your network and it depends on your needs.

But most of the time, it's not going to be a best effort

because that's usually going to give you

not as much quality as you're really going to want here.

Now, when we start categorizing our traffic out there

we're going to start using these different mechanisms,

either soft or hard QS, for doing that.

And we can do that using classification and marking.

We can do it through congestion management

and congestion avoidance.

We can use policing and shaping.

And we can also use link efficiency.

All of these choices fall under a soft QoS or hard QoS

depending on your configuration that you've set up

within your network appliances, firewalls, or routers.

As I mentioned before,

we have different ways of categorizing our traffic.

We can do it through classification, marking,

utilizing congestion management, congestion avoidance,

policing and shaping, and link efficiency.

All of these ways, are ways for us to help implement

our quality of service and take us from this to this.

Now, as you can see,

we want to start shaping out those peaks and valleys

using these different mechanisms

to give us a better quality of service.

Now, when we look at the classification of traffic,

traffic is going to be placed

into these different categories.

Now, this is going to be done

based on the type of traffic that it is.

There's email, but even inside of email,

we have many different classes

of information inside of an email.

If you think about email,

we have POP3 traffic, we have IMAP traffic.

We have SMTP traffic. We have Exchange traffic.

Those are four different types right there.

And so we can look at the headers

and we can look at the packet type of information

and we can even use the ports that are being used.

And then we can determine what services

need higher or less priority.

We can then do this, not just across email,

but across all of our traffic.

And by doing this, this classification

doesn't alter any bits in the frame itself or the packet.

Instead, there is no marking inside of there.

It's all based on the analysis of the packet itself,

the ports and the protocols used,

and our switches and routers are going to implement QoS

based on that information.

Now, another way to do this, is by marking that traffic.

With this, we're going to alter the bits within the frame.

Now we can do this inside frames, cells, or packets,

depending on what networks we're using.

And this will indicate how we handle this piece of traffic.

Our network tools are going to make decisions

based on those markings.

If you look at the type of service header,

it's going to have a byte of information or eight bits.

The first three of that eight bits is the IP Precedence.

The next six of that is going to be

the differential control protocol or DSP.

Now you don't need to memorize

how this type of service is done inside the header.

But I do want you to remember one of the ways

that we can do this quality service

is by marking and altering that traffic.

Next, we have congestion management.

And when a device receives traffic

faster than it can be transmitted,

it's going to end up buffering that extra traffic

until bandwidth becomes available.

This is known as queuing.

The queuing algorithm is going to empty the packets

in a specified sequence and ML.

These algorithms are going to use one of three mechanisms.

There is a weighted fair queuing.

There's a low-latency queuing,

or there is a weighted round-robin.

Now let's look at this example I have here.

I have four categories of traffic:

Traffic 1, 2, 3, and 4.

It really doesn't matter what kind of traffic it is,

for our example right now,

we just need to know that there's four categories.

Now, if we're going to be using a weighted fair queuing,

how are we going to start emptying these piles of traffic?

Well, I'm going to take one from 1, one from 2,

one from 3, and one from 4.

Then I'm going to go back to 1 and 2 and 3 and 4.

And we'll just keep taking turns.

Now, is that a good mechanism?

Well, maybe. It depends on what your traffic is.

If column 1, for example, was representing VoIP traffic,

this actually, isn't a very good mechanism,

because it has us to keep waiting for our turn.

So instead, let's look at this low-latency queuing instead.

Based on our categories of 1, 2, 3, and 4,

we're going to assign priorities to them.

If 1 was a higher priority than 2,

then all of 1 would get emptied,

then all of 2 would get emptied,

and then all 3 and then all of 4.

Now this works well to prioritize things like

voice and video.

But if you're sitting in category 3 or 4,

you might start really receiving

a lot of timeouts and drop packets

because it's never going to be your turn.

And you're just going to wait and wait and wait.

Now the next one we have is called the weighted round-robin.

And this is actually one of my favorites.

This is kind of a hybrid between the other two.

Now with a weighted round-robin,

we might say that category 1 is VoIP,

and category 2 is video, category 3 is web,

and category 4 is email.

And so we might say that in the priority order,

1 is going to be highest

and we're going to use a weighted round-robin,

and we might say, we're going to take three

out of category 1, two out of category 2,

and then one out of 3 and one out of 4.

And we'll keep going around that way.

We'll take three, two, one, one, three, two, one, one.

And we keep going.

That way, VoIP traffic is getting a lot of priority.

Video is getting the second highest priority.

And then we start looking at web and email

at the bottom of the barrel,

but they're still getting a turn

every couple of rounds here.

And so that way it becomes a weighted round-robin.

As I said, this is the quality of service mechanism

that I really like to implement inside my own networks.

Next, we have the idea of congestion avoidance.

As new packets keep arriving, they can be discarded

if the output queue is already filled up.

Now, I like to think about this as a bucket.

As you can see here, I have a cylinder on the bottom

and it has a minimum and a maximum.

Now, if it's already at maximum and you try

to put more into the bucket,

it just overflows over the top.

Now to help prevent this, we have what's called

the RED or random early detection.

This is used to prevent this overflow from happening for us.

As the queue starts approaching that maximum,

we have this possibility

that discard is going to happen.

And so we start doing is we start dropping traffic.

Instead of just dropping traffic randomly,

we're going to drop it based on priority,

with the lowest traffic priority getting dropped first.

RED is going to drop packets from the selected queues

based on their defined limits.

Now I might start dropping TCP traffic first

because I know it'll retransmit itself.

Where UDP, if you drop it, it's gone forever.

And so I might keep that in my queue a little bit longer,

so it doesn't get dropped.

Now, that's the idea here with TCP traffic,

even if I drop it, we're going to get that retransmission

and we'll try again.

But with UDP, if it dropped,

you're never going to know about it,

and you're going to have loss of service.

Now, when you're dealing with congestion avoidance,

we're going to try to use the buffer

to our advantage, and be able to use it to help us

get more bandwidth through.

Now, when we start putting all these things together,

we start getting into these two concepts,

known as policing and shaping.

Policing is going to discard packets

that exceed the configured rate limit,

which we like to refer to as our speed limit.

Just like if you're driving down the highway too fast,

you're going to get pulled over by a cop

and you're going to get a ticket.

That's what policing is going to do for us.

Now, we're just going to go and drop you off the network

anytime you're going too fast.

So, drop packets are going to result in retransmissions,

which then creates more bandwidth.

Therefore, policing is only good

for very high-speed interfaces.

If you're using a dial up modem or an ISDN connection,

or even a T1, you probably don't want to use policing.

You're much better off using our second method,

which is known as shaping.

Now, what shaping is going to do for us

is it's going to allow the buffer

to delay traffic from exceeding the configured rate.

Instead of dropping those packets like we did in policing,

we're just going to hold them in our buffer.

Then when it's empty and there's space available,

we're going to start pushing it

over that empty space and start shaping out the packets.

This is why we call it shaping or packet shaping.

Now you can see what this looks like here on the screen.

I have traffic at the top,

and you'll see all those jagged lines going down.

Now, what really happens here in your network

is there's this high period of time,

and there's low periods of time,

because not everything is happening

all the time in an equal amount.

If we do policing, all we did was chop off the tops,

which gave us more retransmissions and was shaping.

Instead, we're going to start filling

in from the bottom, from our queue.

So it keeps up there right towards the speed limit

without going over it.

Again, shaping does a better job

of maximizing your bandwidth,

especially on slow speed interfaces,

like a T1 connection, a dial up,

satellite connections, or ISDN.

Then the last thing we need to talk about here

is link efficiency.

Now there's a couple of things we need to mention

in regard to link efficiency.

The first of which is compression.

To get the most out of your link,

you want to make it the most efficient possible.

And so to do that, we can compress our packets.

If we take our payloads and we compress it down,

that's going to conserve bandwidth

because it's less ones and zeros

that need to go across the wire.

VoIP is a great thing that you can compress

because there's so much extra space

that's wasted inside of voice traffic.

VoIP payloads can actually be reduced

by up to 50% of their original space.

We could take it down from 40 bytes

down to 20 bytes by using compression.

If you think that's good, look at the VoIP header.

I can compress the VoIP header down

from 90 or 95% of its original value.

I can take it from 40 bites down to just two to four bytes.

To do this, we use something called compressed RTP or cRTP.

Now, when I have the original VoIP payload,

as you can see here, I have an IP address,

I have UDP as my packet type,

and I have RTP for its header.

And then I have my voice payload.

I can compress all of that down into just a cRTP,

which consolidates the IP, the UDP,

and the RTP altogether into one.

The voice payload can also be squeezed down

to about half of its size.

Now you're not going to notice a big difference

in your audio quality either by doing this,

this can be utilized on slower speed links

to make the most of your limited bandwidth.

And it's not just for VoIP.

You can do this with other types of data too.

Compression is a great thing to use.

They have devices out there called WAN accelerators.

That focus specifically on compressing your data

before sending it out your WAN link.

The last thing I want to talk about here

is what we call LFI, which is another method

to make more efficient use of your links.

This is known as link fragmentation and interleaving.

Now what this does is if you have a really big packet,

it'll start chopping those up

and take those big packets and fragment them,

and then interleave smaller packets in between them.

This way, it's going to allow you to utilize

those slower speed links to make the most

of your limited bandwidth.

Notice here I have three voice packets,

and one big chunk of data.

Now what the router would do, is it's going to chop up

and put that one small voice piece

and then one small data piece,

and then one small voice piece,

and one small data piece.

That way, the voice doesn't suffer

from huge latency by waiting for that big piece

of data to go through first.

By doing this fragmentation and interleaving,

it allows you to get some of that high priority traffic out

in between those larger data structures as well.