Keynote at NANOG 77 by CTO Karl Auerbach

IWL Chief Technical Officer Karl Auerbach delivered the Keynote speech at NANOG 77 in Austin, Texas on Tuesday, October 29, 2019.

Network Operations On A Public Utility Internet

Transcript:

Network Operations On A Public Utility Internet

--SHOWING: Opening title--

Good morning.

My name is Karl Auerbach.

I am CTO at IWL, InterWorking Labs.

We are a company that builds tools to help developers exercise their stuff under less than perfect network conditions.

I've seen a lot of gear crumble when subjected to conditions that are unusual, but still RFC compliant.

That's OK in a developer's test lab.

But it happens too often in production.

There is no perfect network code.

Even the Linux IP stack can be brought to its knees with perverse IP fragmentation.

We spent a couple of years working with DARPA on rescue robots.

I used our gear to introduce the kinds of network problems one might have during a disaster situation.

So, think of us next time a Terminator-looking robot carries you out of a burning reactor building.

I first encountered the net when I was an undergrad at UCLA during the ARPAnet days.

I had a student job running a big grey IBM 7094.

This was a machine that cooled its memory with oil.

It had a dipstick.

And, like all memory, it sometimes leaked.

Real puddles.

Really slippery.

Our daily operations included a bag of kitty litter and a broom.

I got into networking by proximity:

That leaky 7094 was in the room next to ARPANET IMP #1.

Around 1972 I took a job at System Development Corporation.

That's the "SDC" in the lower left of the map.

--SHOWING: Q7A Computer Console Room--

SDC was formed to write the code for the 1950's SAGE early warning radar computers.

The Q7 shown in this slide was among the most powerful computers in existence.

It was physically large:

Each machine filled a three story building.

The Q7 used 80,000 vacuum tubes.

A colleague walked me through the building that once held the Q7.

He said things like "this room held the adder, that room held the memory."

Most of the code was designed by and written by women.

The successor to the Q7 was the Q32.

Only one Q32 was ever built.

And it, too, was located at SDC in Santa Monica.

Ballistic missiles were replacing bombers.

The SAGE system was becoming obsolete.

The Q32 was never put into service.

That left a huge computer without many nearby users.

So the government begin to think about ways to share the Q32.

That became one of the streams that led to the ARPAnet.

During my time at SDC I worked on secure operating systems and networks.

Much of what we did was classified.

Our work was invisible to the public.

We were probably the first to come up with several ideas that were re-invented later, such as VLANs and key management systems.

My particular speciality was capability based machines and operating systems.

But I also did a lot of protocol design.

We split the then monolithic TCP protocol into two parts:

These went on to become IP and TCP.

But we went further and inserted a new security layer in the middle.

Think IPSEC on steroids.

Despite this auspicious beginning I began to live a double life.

Curiosity and a quite interesting weekend in San Francisco induced me to go to law school.

So by day, I live the wild and free life of a network techie.

By night, I'm a lawyer.

When I look at the internet I see not only a bundle of interesting technologies but also a collection of duties and responsibilities.

Today I want to reflect on the uncomfortable notion that as the internet becomes ever more deeply enmeshed into our lives, new obligations and constraints are going to land on network operators.

--SHOWING: Battleship Main News Headline--

Every week seems to bring a new proposal to cure some internet problem.

Two years ago it was net neutrality, last year it was the GDPR.

This week it is encryption back doors, again.

To those of us in the tech community these proposals can look like a dentist with a chain saw.

And these proposals are often mis-directed.

For instance, regulators don't seem to understand that inspection and filtering of IP packets isn't a very good way to cure to social media abuse.

On the other hand there are issues that do fall squarely into the bailiwick of the internet operator community.

People need timely, reliable, and trustworthy carriage of IP packets.

That's your job and your responsibility.

That responsibility may extend to ancillary services, such as DNS.

The internet is being merged with our utility and financial systems.

The internet has become a critical infrastructure.

It is not surprising that our governments are beginning to consider regulation of those who provide that infrastructure.

--SHOWING: Locomotive hanging over edge--

We all know about Mr Murphy and his law: that the worst things happen at the worst times.

When it comes to the Internet, Mr. Murphy has lots of help.

These range from the merely inept, to script kiddies, to security organizations run by hostile governments.

There are vendors who ship buggy, often under-tested products, or products with suspected Trojans.

And network operations can be complicated -- there's lots of room for error.

Users and businesses are increasingly entrusting their heath, safety, private affairs, and finances to the net.

Users are depending on us.

It's annoying when we ask Alexa to turn off a light and she says "not now."

It is far more troublesome when an airline has to cancel flights because of a network problem.

And there's a YouTube video showing a doctor using the net to perform remote-control brain surgery at a distance of 3000 kilometers.

(https://youtu.be/qiZrWx51zp0 )

We here understand that route flaps and errant backhoes do happen.

That doctor probably does not.

--SHOWING: PG&E truck in Paradise fire--

You and I know that the net is imperfect.

But our opinions don't really count for much.

The public is starting to perceive the net as a critical utility.

And it's their opinion that really matters.

When Mr. Murphy's hammer comes down on the anvil of user expectations who do you think is going to get clobbered?

Those of you in this room -- the operators.

Who is going to get the blame when things don't work?

You, the operators.

--SHOWING: Boss Tweed pointing to others--

Getting blamed will be expensive.

So there will arise a secondary game:

Some Other Dude Did It.

The goal of this game is to shift the blame and expense to someone else.

This game won't be pretty.

It won't be fun.

It won't be voluntary.

Think of it as a kind of Game of Thrones.

But with lawyers instead of dragons.

--SHOWING: Old Baily--

The rest of this talk is about how operators can play this game fairly, honestly, and with minimal cost and pain.

I'm going to talk a bit about legal concepts.

Some of this may seem foreign.

Despite a lot of evidence to the contrary, the law is not irrational. It is based on hundreds of years of hard experience with actual disputes.

I'm only going to talk about civil law.

I think it is safe to say that nobody here intends to enter the criminal realm.

--SHOWING: Chaos in the courtroom--

A basic principle of civil law is that if Alice unlawfully harms Bob then Alice must compensate Bob.

Compensation is usually in the form of money damages.

But, as this slide shows, compensation can take other forms.

The basic idea is simple.

But there are a lot of details.

And even the details have details.

It is too much to completely cover in this short talk.

So I'll just touch on a couple of aspects.

--SHOWING: Laurel & Hardy, Swiss Miss--

The most important thing for you to know is this:

Liability is neither absolute nor automatic;

there are a lot of conditions.

For our purposes the most important issue is whether you run your shop with sufficient care.

What is "sufficient care"?

There is no simple answer.

But one can sketch a few boundary lines:

The first line is "don't do stupid things."

A better line is "do things at least as well as the best of your competitors."

By-the-way, there is a concept called "strict liability."

It is as scary as it sounds.

It is a blunt legal weapon that throws out the excuses.

Under this doctrine, if you cause harm then you pay.

Fortunately for us, this doctrine usually triggers only when there is a chance of severe physical human harm.

But let's not get over confident;

We may already walking on thin ice:

Don't forget that video about doing brain surgery over the net.

And California is now applying strict liability to electrical utilities that start fires.

Enough about strict liability, let's move on to the more common situation:

--SHOWING: Leaning Tower of Pisa--

We all make dumb mistakes.

And most situations are not clear cut.

I can't promise that they will make you rich and famous, but good operational practices and record keeping will reduce ambiguity and make your life easier.

So what should you do?

From what I've observed on the NANOG mailing list and at meetings, you are probably doing things pretty well already.

However, you are almost certainly going to have to do more annoying paperwork and formalized record keeping.

Oral communication not backed by paper will cause trouble.

Anything done not in accord with a pre-adopted procedure will cause trouble.

Ad hoc anything will cause trouble.

You want to make written records as much a part of your regular operations as possible.

Otherwise you risk falling into some of the law's biggest tar pits, such as the hearsay rule.

--SHOWING: Obvious Things--

Let me provide a few suggestions.

I'm going to skip through these pretty fast.

Don't try to read them here.

You can see these lists in more detail in the offline materials.

On this slide I begin with things that are fairly obvious.

You are probably doing these already.

The key point here is that as much as possible you should be using pre-defined broadly practiced procedures.

  • Use your trouble ticket system -- and be verbose.

  • Don't do anything that can't be tracked back to a ticket.

  • Have written procedures and follow them.

  • Use checklists to make sure that procedures are followed.

  • Work with others in the industry to adopt design rules and best practices.

--SHOWING: Less Obvious Things--

Here's some things that I suspect some of you may not be doing.

The key ideas here are:

  • Be qualified

  • Be suspicious

  • Be prepared

I want to highlight the last two items on the slide, insurance and performance baselines.

You need to have insurance.

Insurance is your backstop.

But you need to make sure that it is up to date and actually covers what you think it covers.

Baselines are your early warning system.

You need to know how your net behaves under normal situations.

Then you can watch for deviations from those baselines.

By-the-way, don't just look at first order deviations.

Look for inflections in the rate of deviation.

Look at second derivatives.

  • Have, and use, a triage procedure to assign priorities. Use it even when you are not overloaded.

  • Make sure that people who do tasks are properly qualified. Consider periodic re-validation of skills and yearly refresher classes.

  • Keep before and after records of configuration changes.

  • Don't trust what vendors tell you -- verify.

  • Test equipment and software before upgrades.

  • Have a sandbox for that testing.

  • Think about what could fail; have contingency plans.

  • Have insurance.

  • Have baseline measures of your traffic flows and routing patterns: Generate warnings when there are sudden changes that have not been seen previously.

--SHOWING: Things To Consider--

And here are things you probably are not doing.

This is a long list -- I'll touch on just a couple of points.

Your records may be your best defense or your worst weakness.

If you don't have regularly practiced systems the other side can -- and probably will -- use that against you.

You must have -- and use -- a well defined document destruction schedule.

The networking community has a long history of people working with one another. That's a good thing.

But we have to be careful to avoid any restraint of trade landmines. Keep your lawyers in the loop when cooperating with competitors.

You don't want to put your own personal assets at risk.

You do not want to hear the awful phrase "your corporate veil has been penetrated."

So adhere to corporate formalities.

You want to have a corporate secretary who likes checklists and who is hard nosed about following seemingly silly procedures.

  • Keep archival copies of everything.

  • Have a regular schedule for purging old materials and archives. Religiously follow it.

  • Avoid ad hoc procedures. If you encounter a new situation create a written procedure for it before resolving it even though every neuron in your body is screaming "fix it now."

  • Don't invent new procedures for yourself unless yours are clearly and demonstrably better. And when you do invent be sure to document your reasoning. Articulate with clarity why you are deviating from the norm. Your reasons should be compelling.

  • Equipment and service agreements should push some responsibility back onto the vendors.

  • Remember that you may have obligations to notify the public or regulatory bodies.

  • Before you work with competitors be sure to get the advice of an anti-trust lawyer.

  • Make sure your corporate structure is strong -- corporate formalities are important to keep liability from leaking onto management, shareholders, or owners.

  • Develop databases of known network pathologies so that when you see symptoms you can quickly look up what that might indicate and what diagnostic tests are in order.

Yeah, sometimes there are emergencies that require immediate and drastic measures.

Alex Latzko and I once had to resort to percussive maintenance.

We hammered the beejeebers out of an equipment cabinet because its cable passages were too small.

It was necessary.

But there was also no doubt that our organization was obligated to cover the costs.

Most risk to operators is going to arrive under the banners of "recklessness" and "negligence."

-SHOWING: Chernobyl--

By now you've all seen the HBO series "Chernobyl."

That was reckless.

It is usually pretty easy to avoid being reckless.

Do not disregard known consequences deliberately and without justification.

Don't disable the interface to a hospital surgery center without calling them first.

--SHOWING: Banana peel--

The legal doctrine of negligence will pose the greatest risk to operators.

So what is "negligence"?

That's a hard question to answer.

It is not simple.

In addition, negligence is a moving standard. It changes with time and context.

Let me dispel the most common fear:

You are allowed to make mistakes.

A core idea in negligence is the notion of a standard of care.

This is a standard largely established by the behavior of your peers.

In our case, of other operators and providers.

If everyone else has backup power and you don't -- then you may be falling below the required standard of care.

If everyone else does sandboxed tests of updates and you don't -- then you may be asking for trouble.

But this standard of care evolves.

A procedure that was good enough last year may be inadequate next year.

There is no bright line that marks what is good enough.

The standard is an imagining of what a hypothetical "reasonable person" would do.

Look around the room.

Consider what your peers may conclude are the minimal acceptable practices of today.

That's the starting point for figuring out today's standard of care.

Strive to exceed it.

Fail at your peril.

Moreover, as the net is increasingly perceived as a critical utility, other forces will push that standard of care even higher.

Over time the industry as a whole will be held to ever higher standards.

You will have to keep up.

You should build periodic reviews of this question into your management plans.

Use your law firm as an outpost to warn you of significant legal cases or changes in the law.

Charter a working group here at NANOG to watch for important events.

I've only begun to scratch the surface of the complexity of living in the liability fast lane.

--SHOWING: Goldberg Napkin--

Things can get much more complicated.

In passing I'll glide over one complication, proximate cause.

I won't go into details even though there are lots of intriguing examples.

There's a famous case about a woman injured by a clock knocked over by a bomb accidentally dropped by an anarchist as he leapt onto a moving train.

And another case in which a city flooded as the result of river ice knocking free a pair of badly anchored ships that were then trapped by a drawbridge that was raised too slowly.

Sometimes your mistake won't affect anyone until events have passed through several downstream intermediaries.

Sometimes you will be one of those intermediaries.

For instance if you propagate a bogus route that you received.

Don't expect that the presence of intermediaries who could have mitigated your error to absolve you.

Nor are you excused if you are an intermediary and your propagation could be considered negligent.

This is a prodigiously deep swamp.

Your best strategy is to be very skeptical of stuff you get from your upstreams.

Given recent events we can anticipate a lot of pressure to become more careful about accepting paths or prefixes.

You might have to become proactive.

For instance, we may need better heuristics for noticing when things are starting to wobble.

We may need to raise a "somebody take a look" alarm when there is an unusual traffic shift.

In another common service -- DNS -- you are often an intermediary, propagating information obtained from others.

Recently there has been a lot of chatter in government circles about imposing bundles of new rules on DNS operations.

What these rules may be is not yet clear.

But in the worst case you could get in trouble for publishing DNS information that came from elsewhere.

This could be a nightmare for those of you offering recursive resolvers.

And lets not forget DNS over HTTP.

How are you going to analyze customer problems when your view of the domain name system is different from theirs?

--SHOWING: Damocles--

There is a sword of Damocles hanging over us.

The era in which we could think of the net as a collection of independent operators is ending.

The public is going to demand responsibility and accountability from operators.

If people are left feeling that they've been unfairly treated then courts or legislatures may step in.

As we have already seen in certain areas of internet governance, legislators are not techies.

They could demand the impossible.

They could demand that route outages be cured within one second.

Or that we deep inspect packets to block photos of people arrested but not yet tried.

As I speak the EU is considering new impositions on domain name operators.

If it is too difficult to pin the tail of responsibility onto a particular provider then courts may assign what is called "joint and several" liability.

That means that everybody is fully liable.

What sorts of things can we anticipate being mandated onto the operator community?

  • Increased liability.

  • Mandatory insurance.

  • Financial depth requirements.

  • Imposed procedures, including increased documentation and reporting.

  • Equipment certification by third parties.

  • Professional standards for staff -- including testing and licensing.

  • Periodic retesting or supplementary education.

  • Regulation via public utilities commissions or the FCC.

  • Ownership restrictions.

  • Recovery time limits -- third parties might be allowed to step in and take control should the time be exceeded.

--SHOWING: SMBC lobbying--

This may seem daunting.

However, other industries have faced similar issues.

Use NANOG and the IETF as places to share ideas.

But take care not to trigger laws about anti-competitive practices.

Push standards bodies to be more concerned than they have been about failure modes, brittleness, diagnostics, and recovery.

Begin to create formal and informal links to legislative and regulatory bodies.

Begin to think like lobbyists.

I've only begun to scratch the surface of the issues that are going to arise as users, governments, and courts increasingly perceive the internet as a critical, unified, public utility.

Arguments that the net is composed of independent cooperating providers are most likely going to carry little weight.

Those arguments will be crushed under the bulldozer of the public need for dependable service and accountability.

Rough times for operators may be coming.

Some of this will be undeserved.

Some will be.

Don't try to evade that responsibility.

Don't play Some Other Dude Did It.

Be open and honest about internet problems and difficulties.

Openly work for solutions.

You are the supreme experts in getting IP packets from hither to yon.

Manage that expertise carefully and you will have influence.

You now have a window of opportunity to shape much of the law and regulation that is almost certain to land on you.

I urge you to take advantage of that opportunity.

That window won't be open for long.

--SHOWING: Joze Tools--

Before I conclude this talk I want to take a moment for something completely different.

I don't have a slide for this.

So imagine a drawing of some tools, say a screwdriver and a hammer.

The issue I want to raise is this:

As a utility, network operators, especially those providing services directly to consumers, are going to face time-to-repair pressures.

Over the years I've owned or raced a lot of sports cars.

I've had a slew of 1960's Italian and British cars.

And more recently I've had BMWs and presently a fairly new Miata.

The differences are qualitative.

Those old cars would always have something wrong or out of tune.

The tool kits for those old cars were primitive -- screwdrivers, pliers, wrenches, and hammers.

The tools we use for todays networks are equally primitive -- pings, traceroutes, and other things that were invented decades ago.

Automobiles have changed.

Over the years their anti-smog systems evolved into really smart control planes.

In the past we had to mess with points and a dwell meter.

Today the ignition control computer figures out the spark timing.

This automotive control plain contains a control bus with a standardized plug from which one can simply read diagnostic codes.

This has allowed modern cars to be diagnosed and repaired far more quickly and accurately.

We need the same for networks.

We need smarter control planes.

The internet is a distributed process rather than merely a set of interconnected devices.

We need a control plain that looks at a net, or the entire internet -- doing both passive and active measurements -- to look for excursions from the norm.

We might even give this new control plane some ability to change configurations or provisioning.

But we don't want to create a monster.

We must avoid the Boeing MCAS flaw.

We don't want a system that takes control and flies our networks into the ground.

How would such a control plain work?

That's a question that has been on my mind for a long time.

Perhaps we could use big data techniques.

One can imagine a company like Amazon, Netflix, or Google doing predictive dynamic provisioning based on real time feeds of web search, Twitter, and DNS activity.

This could engender tsunamis of traffic that operators will have to handle.

And we need a database of network pathologies.

We would like some yet-to-be-invented software to take a set of symptoms and work backwards towards possible causes.

Another thing we need are protocol designs that are better able to degrade gracefully rather than snap into total failure.

For a long time I've been wondering what we can learn from biological systems.

Living things are very complex systems.

Our networks are trivial by comparison.

Yet living things are very robust.

Why?

Because evolution has tended to produce layers of mechanisms, all different, that together push and prod the organism.

These mechanisms work against one another to achieve a balance.

Like an acid and a base in a buffered solution.

Or consider how the TCP protocol has competing slow start and congestion backoff algorithms.

The goal of an organism is not to reach an optimal state, but to survive.

My observation is that we need network control planes that use a multiplicity of methods to nudge our networks to survive problems.

The goal of that control plane is survival of the network service.

Network management and control has long been a black sheep in the internet community.

And we have long been focused on mere access methods such as SNMP and Netconf

We have not actually dealt with the hard issues.

It is my opinion that as the net becomes a critical utility, we need to revisit questions of diagnostics, repair, and management.

With that I reach the end of my talk.

I am happy to take questions here or to discuss these things off-line.

--SHOWING: Closing title--

Thanks!

Previous
Previous

A Holistic View of Network Emulation

Next
Next

Deeper Implications of the Great PG&E Outage