Outage Stories: The Flapping Link

2022-10-27T00:00:00+00:00

Early in my career I was assigned a project to introduce a new way of counting the bytes used by cell phones for the carrier I worked for. This project itself was a complete disaster, which is a story for a different time, but this project also helped me uncovered and understand a number of other network problems.

The story I want to focus on today, covers degraded service, where for weeks, if not longer, cellular internet service in Western Canada was unstable. And explore what circumstances allowed the outage to occur.

A little background</a></h1>

I'm going to gloss over a lot of the more nuanced details in this post and a lot of technical specifics of cellular networks. These specifics aren't that important for the contents of this post.

What is important to know, is this story revolves around the connections between two cellular carriers in Canada. Canada is vast, and building complete nationwide coverage is expensive. So the carriers using common technologies made agreements to roam on each other to extend coverage. This wasn't an exact divide. High customer density areas would still be built by each carrier, but as you moved out of coverage you'd switch to the other carrier. This is different from today, where carriers build a common radio network and subdivide who builds what part of the country.

The link that failed was between Toronto (our network) and Calgary (partner network). So a user in Western Canada, would route internally in the partner network to Calgary, cross the link between companies from Calgary to Toronto, and go through the home network in Toronto to reach the internet.

From an internet routing perspective, the cell phone appeared to be in Toronto, and then routed through the cellular data network to reach the device wherever it was in Canada.

If this sounds a bit silly or high latency, it was. But at the time every cell phone combined on the network fit on a single 100 Mbps ethernet connection.

The two systems don't match</a></h1>

The project I was working on was introducing new equipment to count the bytes used by customers for billing purposes. Normally, this was just done directly by the cellular equipment as it processed the packets, but my company wanted to introduce a bunch of new capabilities and flexibility the existing equipment didn't support.

So naturally, when introducing new traffic counting, we compared the counts per customer against the old equipment. And the problem was, they were different, not even close to producing the same results.

So a big part of my job became figuring out why these two systems were producing wildly different usage counts.

And I found lots of problems, in both the new and old equipment. The new equipment, for example could miss a message and get out of sync with which customer has what IP address. And start counting traffic for one customer against the last customer who had that IP address. So we had to develop a reconciliation that could throw out those dirty records.

On the old system, we had problems with rolling the byte counter. So a user who used more than 4GB (32bits) within a 24 hour record, would wrap back to 0. So some users figured out if they used enough data fast enough and disconnected, they got a small bill instead of a giant one. And some customers were in for a surprise when we fixed that bug.

One enterprise customer I heard about, was doing some sort of video streaming. When they did the trial, they were using just a bit over 4GB, so were getting tiny bills. So they signed a contract. Then we fixed the bug a month or two later, and their bill skyrocketed. They thought we did a bait and switch, even though it was a far more innocuous, we just discovered and fixed a bug.

But with all these problems and fixes, the systems still didn't match. I built a series of programs and scripts that did comparisons of records to identify which records were different and why. And one day when trying to break the data down in different ways, I separated the data by geographic region within Canada. And to my surprise, Western Canada was way worse than the rest of the country for records that didn't match between systems.

Why Western Canada?</a></h1>
In theory, there shouldn't really be anything special about Western Canada. Although another carrier provided the cell sites and access network, they used basically all the same equipment we did. It shouldn't be one of those crazy 500-mile email problems</a>, as we didn't see problems in Eastern Canada. 
As with any troubleshooting, I started to develop, test, and reject theories and possible explanations. Until I noticed an interesting and subtle correlation. We recorded separate counters for upload and download, and it was only the download counters that didn't match. This didn't match any of the many problems with this new equipment I had been investigating for weeks.
And when iterating on theories and possible explanations, I hit on one that turned out to be unexpected. What if the equipment was actually working. What if the two systems were counting different byte counters, because they physically saw different numbers of bytes. 
One of the details of this new byte counting, was it always counted in Toronto. But the old system, collected its packets from the access network. For Western Canada packet counting took place in Calgary in the partner network. 
So I started sending some pings... and to my surprise, I discovered a huge amount of packet loss between Toronto and Calgary.

Turns out I stumbled into a giant issue</a></h1>

By losing packets, this was a much bigger problem than just my project of introducing new features. If those packets are getting lost before reaching the partner network, they're also not making it to customers.

For anyone who doesn't have a background in networking, losing 1% of packets doesn't mean the website just loads 1% slower. Due to a series of network collapses in the early days of the internet, TCP stacks a decade ago aggressively slow down their throughput when seeing almost any packet loss. This has changed over time, but at the time on an already slow wireless network, this packet loss might not just be the internet being slow, the web page might not load at all, or the email never received. And I'm pretty sure I saw way higher than 1% packet loss.

So I started pinging hop by hop on the network, until I isolated where the packet loss began, and isolated the link directly between companies. And upon logging into the edge router, between the networks, I discovered in the logs.

... bgp up ...
... bgp down ...
... bgp up ...
.... bgp down ...
</span></code></pre>
Checking the hardware, the underling links appeared fine, with no apparent physical layer problems, errors, etc. So the link that carried packets initially appeared healthy, but the routing protocol running on top this apparently stable link was flapping up and down. </p>
And if the routing connection is flapping, routes that use that link will be added and withdrawn. While down, traffic would be rerouted via another set of links in Montreal. </p>
Why was BGP unstable?</a></h1>
This particular network link was incredibly funky for reasons I'm glossing over. BGP itself operates by establishing a TCP connection between routers, that is then used to exchange routing information. While the BGP connection is alive, it will send keep alives to test the peer is alive. If those keepalives don't make it, the peer must be having a problem, so we should withdraw the routes, and alternative paths may be used if possible.</p>
So why weren't the keep-alives making it?</p>
Well besides the flapping BGP connectivity there was one other sign of a problem. Packet drops due to network interface overload. We were trying to send more packets than the link could handle, and it was not only causing packet loss, but the BGP link to flap. </p>
So the routing table kept flapping, rerouting traffic through Montreal. Then while traffic is re-routed, the link would be idle. So BGP would come back up, see a better path, and re-create the routes through the link. The link would overload again, BGP would time out. Rinse and repeat.</p>
Mistakes were made</a></h1>
As much as we all hate our local telecoms, at least in North America, there is an incredible culture around stability. While the current big tech companies were developing SRE and production cultures, Telecoms were running incredibly reliable and redundant systems.</p>
So a question some of you might be asking by now, isn't the job of a network operator to basically deliver voice, text messages, and internet? And these are billion dollar companies, don't they have teams dedicated to just managing that enough capacity is available. Shouldn't someone notice and be planning months ahead of time for capacity needs of the system. </p>
Well just like any complicated failure, there were multiple small issues that combined in the exact right way to allow this problem to impact customers.</p>
One of the most key mistakes, was how the capacity team monitored capacity and timed upgrades and increases in link sizes. There charts were based on interface counters, that were aggregated into 5 minute intervals.</p>
So the capacity engineers were looking at something like this:</p>

</p>
As far as anyone looking at that report would be concerned, is the network links are a hotter than they'd like, so they're planning for link upgrades, but there isn't a rush. So they can take their time with ordering new links and getting them setup. The new links were only a couple of weeks away when I discovered this problem.</p>
But what the capacity team didn't know was the underlying link was flapping, so if they had metrics that say recorded every second, they would see something like this:</p>

</p>
The capacity team just couldn't see it. </p>
What about other teams, such as the IP operations and routing team? Surely they would notice. It turns out they did notice, but somehow they developed an understanding that they weren't to worry about this link. </p>
In software development we use Code Reviews and Pull requests for making changes. Have more than one engineer check over work and changes. In Telecom at the time, this check and balance was by having different teams. Operations are separate from Engineering as a discipline. So engineering would submit work orders to operations for changes. And the mobile wireless teams in a similar structure would drive changes with the IP routing teams. </p>
This left the IP operations team several layers removed from what any of their links were actually for. So out of the 10,000 links they were responsible for all over the country, there is some link on an edge router in Toronto connected to another cellular carrier in Calgary. So missing the context on what this link did, the team just ignored the problem.</p>
There were at least two other teams aware of problems, but didn't have what they needed to figure out the problem. </p>
There was a VOIP product that ran on the data network instead of the circuit network, which had complaints of garbled voice. But working off customer complaints and being a service team removed from the underlying mobile network, they didn't have what they needed to identify packet loss as a culprit. </p>
And the custom service escalation teams did notice, and had lots of complaints escalated from customers. But this is a cellular network with radio connections, there are lots of reasons a web page doesn't load on a phone on those early data connections. And queries to the network teams got responses that the network is healthy. So there investigation steered towards explanations that would not be caused by the underlying IP network.</p>
Culture</a></h1>
What I find fascination about this particular incident, is this wasn't a hard problem to solve. It's barely a technical issue, the link was too small. It got replaced with a bigger link.</p>
Why I find this fascinating, especially with perspective years later, is this was one of my first experiences with how organizations fail. These were all relatively small mistakes any of us could make. But teams with different missions were losing context on how their piece of the network, how their work directly impacted customers. </p>
But there was also a larger cultural problem. For a culture that's supposed to be about availability and stability, far beyond what almost any SRE team strives for today, there was no outage report. I think I tried to open one, and before it was filled out, a manager went in and closed it indicating no customer impact. So there was no organization effort to understand what happened, what improvements can be made, how to prevent reoccurrence. Any sense of an investigation was discarded once the link capacity increased.</p>
I believe the problem was unaligned incentives. One of my favorite articles on the topic, is Sins of Commission by Joel Spolsky</a>, about how incentive plans backfire. The TLDR is that directly connecting employee compensation to some sort of company metrics, is almost guaranteed to backfire. In our case, if the outage impact had been calculated, it might've blown the availability numbers for the year. </p>
And the problem is, everyone's bonus was tied to the availability numbers. And the bonus was a big component of take home pay at that company. </p>
So there was a massive incentive to not understand the issue, and bury the problem. It doesn't even have to be nefarious, the context of the impacts of this problem was all spread out. So customer service teams knew customers were complaining, but the manager who closed the outage investigation, knew a network link was bouncing. That managers understanding was that the service was somewhat working and there was no context this impacted customers at all. The incentive structure nudged the team towards not being curious, and not wanting to find out or learn how degraded service impacted customers.</p>
As an aside, customers were openly complaining on internet forums about how their cell phones didn't work. So customers definitely noticed this problem. </p>
The only reason I have the context to tell the story years later, is that I had relationships with each team that was impacted. And kept a silent eye on the forums for what customers were discussing amongst themselves. </p>
What can we learn from this</a></h1>
I try to always remind myself that we don't know what we don't know. I could've easily and readily made any of the mistakes that lead to this outage. To hit on a few key points:</p>

Troubleshooting tools don't necessarily tell you what you think they do. It's important to know how the tools actually work. For example, metrics are a proxy for what's happening. But are also aggregated or removing precision to make the data accessible and useful. </li>
Be conscious on the incentive structure and culture. Misaligned incentives can be dangerous, and lead to cultures that act contrary to what leadership is trying to achieve. Push back on cultures and incentives where appropriate that cause these sorts of problems.</li>
Not knowing how your work impacts your customers, or the customers of your customers, or why someone pays you for your service, leaves blind spots. Combined with a lack of curiosity, and problems can be allowed to just sit their and fester. So alarms can be ringing and there's no reaction... it's just noise for someone who doesn't know what it's for. This is doubled when the team hearing the alarm isn't the one to fix it, such as capacity increases done by engineering and capacity teams. </li>
Be curious. I think intellectual curiosity is more important than many people realized, especially when responsible for production. There still needs to be a balance on not getting stuck in rabbit holes and getting things done. But that problem you see, might be making a customer have a miserable day. And they may not have a mechanism to yell loud enough to draw your attention to the problem right under your nose.</li>
</ul>
Disclaimer: This story is from more than a decade ago, and doesn't accurately reflect the present state of any company or network. Any opinions expressed are my own.</p>



Outage Stories: The copy and paste outage
2022-08-13T00:00:00+00:00
When working in Telecommunications, one of the things you learn is to be very careful in production. The companies are trying to chase something like 99.999% and higher availability numbers. So mistakes in production are to be avoided, but like anything else, team members get overconfident and make simple mistakes.</p>
One of these outages, was an accidental paste into a terminal. The result of which took out LTE access for a good chunk of Canada. The good news in this case, at the time the phones probably fell back to the slower and higher latency UMTS network. </p>
Context</a></h1>
In these high availability networks, redundancy tends to be a key point. And there is redundancy at multiple layers. So if you have a signalling proxy, you deploy one in Toronto, and one in Montreal. If the Toronto data center has a catastrophic failure, the Montreal data center takes over.</p>
But also within a data center, you don't deploy one router, you deploy 2, and connect servers to both. The servers have 2 power supplies, using DC power supplies fed off a battery room. So a power supply burns out, the server is still running. SFP module burns out, you've got another network connection. Router crashes or is rebooted for an upgrade, and you fail over to another network path.</p>
Some equipment will be deployed as multiple blades or a cluster within a data center. One blade fails, and another takes over. </p>
So for almost any hardware or physical failure you can think of, there is some other path ready to take over. Whether that's networking, power, fire, flood, meteors, or more. At least, that's the theory… it gets a lot harder when it comes to software and state exchange.</p>
The Outage</a></h1>
So we've established that the services should be highly redundant, to achieve high availability. So an admin logged into a single host, shouldn't be able to do much damage. Even an evil rm -rf /</code> should only take out a single blade, and there would be more blades or even more regions to take over.</p>
So what happens if you're logged in as root, and accidentally paste a big block of text, that just happens to contain something along the lines of:</p>
...
to change the hostname you run
hostname derp
and the system will update the hostname
...
</span></code></pre>
Well the shell will try to interpret those commands. Probably throw tons of errors, but, the hostname derp</code> line is a valid command. And when you're root, you're allowed to do things like change the hostname.</p>
Check out the output, you might not even notice the hostname changed because your PS1 is cached. Especially if buried in dozens of lines of errors from the copy and paste.</p>
root@test:/home/knisbet# to change the hostname you run
hostname derp
and the system will update the hostname
bash: to: command not found
bash: and: command not found
root@test:/home/knisbet#
</span></code></pre>
So the hostname got changed, big deal.</p>
Well this particular system was using Linux-HA</a> to do clustering, and the software operated as a cold standby. What that means is for a particular process, it would only be running on a single host within the cluster. And the cluster control plane if it saw a failure, would start a new instance. This is just like if a pod were to fail in Kubernetes, Kubernetes would start a new pod, but years before Kubernetes.</p>
And the way this system detected a process failure was to search the process list for a matching process. If we run the command super-ha-proxy...</code>, and then it's in the process list, the process is working. If it's not in the process list, it must have crashed, exited, etc. and we need to launch a new instance.</p>
And if you pass the hostname as a parameter to your software instance, you're running process would look something like super-ha-proxy --instances=1 --hostname=server01...</code>.</p>
But the script that checks the process list, uses the new hostname when searching the process list. The process is still there and operating normally, but our supervisor can't find it anymore, decides the process has crashed, and starts launching another one.</p>
Except, this software runs in cold standby, so it conflicts with itself when multiple instances are running. Not in a way that causes it to clearly crash, but instead gets very confused. The original process still has connections to all peers, and is trying it's best to operate normally. But the new process, is also establishing connections to peers with the same identifier, and messing with internal state about where to route messages. So now messages are ending up in unexpected processes, that don't know what to do. </p>
If the equipment had just failed, all the peers would've seen that, and started rerouting through Montreal. But the software was still running, in a corrupted state, telling the world that Toronto was still alive and well. But when using those proxies, messages would fail, timing out or getting errors. </p>
So the team when logged in and investigates and notices pretty quickly the hostname is wrong. But fixing the hostname doesn't solve the problem. The supervisor script just starts detecting the old process, and has no idea that the additional instance has been launched. So it's just as happy to keep moving forward in it's corrupted state.</p>
But some reboots later and Toronto is back online.</p>
And a million cell phones lose their LTE internet access...</p>


Taking a look at the Rogers Outage CRTC Letter
2022-07-27T00:00:00+00:00
Last week, the CRTC made public a redacted version of the response Rogers filed with answers to the regulators questions. I started my career in wireless telecommunications for a competitor to Rogers here in Canada, and wanted to dig through the outage. I've been out of the industry for a few years now, but my outdated perspective may be of some use here.</p>
Unfortunately, the redacted version removes almost all the useful information for other carriers or other industries to learn from this outage. But I still dug through the report to try and find what useful information I could.</p>
Resources</a></h1>
Regulator Filings</a></h2>

CRTC Letter to Rogers: https://crtc.gc.ca/eng/archive/2022/lt220712.htm</a></li>
Rogers response to CRTC (docx): https://crtc.gc.ca/public/otf/2022/c12_202203868/4215445.docx</a></li>
Google docs version of the response: https://docs.google.com/document/d/1e8fmZGzy_VaYuLtXgDz62LTh4DfvkPkj/edit?usp=sharing&ouid=116779044200847408744&rtpof=true&sd=true</a></li>
</ul>
Media Reports</a></h2>
If you're instead interested in a media level analysis, here are some of the media reports i've found.</p>

How a coding error caused Rogers outage that left millions without service - Globe and Mail</a></li>
Rogers shares explanation after 'unprecedented' outage - National Post</a></li>
Rogers unable to switch customers to Bell, Telus, despite competing carrier offers - Toronto Star</a></li>
</ul>
CRTC Letter</a></h1>
The letter itself basically outlines that the Canadian Radio-television and Telecommunications Commission (CRTC) wants to inquire into the outage that began on July 8th, 2022. And adds the basis for this outreach, including disruptions to business, emergency services, and more.</p>
You can read the full letter above, but effectively it's an outline of questions that Rogers has been asked to respond to. I'll go through what I see as the important questions and perspective as we go through the response.</p>
Rogers Response</a></h1>
At a high level, Rogers starts by confirming that they had identified the cause of the outage. </p>
We have identified the cause of the outage to a network system failure following an update in our core IP network during the early morning of Friday July 8th. This caused our IP routing network to malfunction. To mitigate this, we re-established management connectivity with the routing network, disconnected the routers that were the source of the outage, resolved the errors caused by the update and redirected traffic, which allowed our network and services to progressively come back online later that day. While the network issue that caused the full-service outage had largely been resolved by the end of Friday, some minor instability issues persisted over the weekend.
</span></code></pre>
What I find interesting about this is the reference to minor instability issues that persisted over the weekend. </p>
When I worked on a similar outage, where the underlying core network went down for 20 or 30 minutes, my team spent the next 10 or so hours finding and fixing the glitches in the cellular network. This was mainly in broken states within different equipment. Nothing exposes those bugs as good as taking out the underlying network at all layers at the same time.</p>
For example think of an HTTP proxy. A client comes to the proxy, makes a request, the proxy holds some request state, and sends the request to a server. You probably know exactly what happens and it's well tested if the upstream server goes down. Just return an error to the client. If the client loses connection, well understood as well, complete the upstream request, and discard it. But what happens if both the upstream and client networks fail at the same time. You start exercising a code path that combines failures that usually aren't as well tested. Combine this with some missing internal state, and you start having failures.</p>
Especially in cellular networks, there are all sorts of state that's distributed around the network and needs to be kept in sync for different call procedures. This can result in things like being able to send but not receive an SMS message. </p>
Our engineers and technical experts have been and are continuing to work alongside our global equipment vendors to fully explore the root cause and its effects. 
</span></code></pre>
This is another interesting point that might be missed on many that don't know the industry. Telecommunication companies primarily operate as integrators of vendors equipment. While this isn't always the case, most equipment to build and create the network will be purchased from vendors. So if you want a cellular radio for LTE, you may buy the particular piece of equipment from Nokia, Ericcson, Huawei, or others. </p>
In this model of buying equipment from vendors, it often creates an incentive to blame the vendor for outages. But even harder at times, can be to convince another company to work on a feature or design change to make configuration errors less likely. Or like any complex situation, there is probably plenty of blame to share between both the operator and the vendor.</p>
Additionally, Rogers will work with governmental agencies and our industry peers to further strengthen the resiliency of our network and improve communication and co-operation during events like this. Most importantly, we will explore additional measures to maintain or transfer to other networks 9-1-1 and other essential services during events like these.
</span></code></pre>
I'll dig into this deeper later on. But the failure of 911 services, especially for cell phones seems like an incredibly bad design oversight. Cell phones when not attached to a network, can already do an emergency call on any network the phone can find. So by continuing to advertise the Rogers network throughout the outage, caused phone to stick to the broken network. Or probably caused an unattached phone to scan for an available network, and not find a working 911 network.</p>
Questions and Answers</a></h1>
About the outage</a></h2>
Provide a complete and detailed report on the service outage that began on 8 July 2022</a></h3>
Unfortunately it looks like Rogers preferred to keep the full details confidential, and the full timeline was attached as confidential. But included are some high level details.</p>
The network outage experienced by Rogers on July 8th was the result of a network update that was implemented in the early morning. The business requirements and design for this network change started many months ago. Rogers went through a comprehensive planning process including scoping, budget approval, project approval, kickoff, design document, method of procedure, risk assessment, and testing, finally culminating in the engineering and implementation phases. Updates to Rogers’ core network are made very carefully.
</span></code></pre>
For those outside of telecom, method of procedure may be unfamiliar. While different companies may operate differently, the method of procedure is basically a document that outlines how a change in the network should be executed. It can be as detailed as every command someone in operations should run to make the change, step by step. It may also contain pre-checks that should be executed.</p>
The philosophy is a sort of separation of inputs, where one engineer will write the procedure for the change to be executed, and then someone tasked with operations will be responsible for execution. I don't know if this is how Rogers is using MOPs however.</p>
Maintenance and update windows always take place in the very early morning hours when network traffic is at its quietest.
</span></code></pre>
I can't comment as to Rogers, however I know at other big telecoms in Canada that this isn't true for every change. There are plenty of changes that are deemed non-risky, and will be executed when convenient. I've also seen different managers and agenda come into play that try to push against change windows to be more efficient, and often a back and forth on the balance of stability vs getting things done. </p>
The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. 
</span></code></pre>
This twitter thread likely covers this side of the impact better than I can: https://twitter.com/atoonk/status/1550896347691134977</p>
The Rogers outage on July 8, 2022, was unprecedented. As discussed in the previous response, it resulted during a routing configuration change to three Distribution Routers in our common core network. Unfortunately, the configuration change deleted a routing filter and allowed for all possible routes to the Internet to be distributed; the routers then propagated abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their memory and processing capacity and were then unable to route and process traffic, causing the common core network to shut down. As a result, the Rogers network lost connectivity internally and to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.
</span></code></pre>
I'm going to nitpick calling this an unprecedented outage. Based on what I can get out of the report and not being able to see the full internal analysis, this seems fairly predictable. The outage explanation is basically some automation removed a piece of configuration that was essential for the routers to function. Without the configuration, routers within the network went into CPU overload processing routing updates. </p>
Someone knew about this and placed that configuration on the router. </p>
This is a known failure mode for most routing equipment. Most of these big routers seem like big powerful machines, but they tend to be more of a fast path for pushing many millions of packets. And a separate compute for sending and receiving signalling about where those packets should go. The control plane side that does the signalling there are often many ways to overload, which will cause peers to think the router has died. </p>
How did Rogers prioritize reinstating services and what repairs were required?</a></h3>
The prioritization of service restoration was always dependent on which service was most relied upon by Canadians for emergency services. As wireless devices have become the dominant form of communicating for a vast majority of Canadians, the wireless network was the first focus of our recovery efforts. Subsequently, we focused on landline service, which remains another important method to access emergency care. We then the worked to restore data services, particularly for critical care services and infrastructure.
</span></code></pre>
I suspect reality is a bit more nuanced that multiple teams were probably looking at the equipment they were responsible for in parallel. If there was a glitch for some subscribers in the LTE network for example, people who focus on the LTE network were likely addressing that. But from a total outage perspective, I'm sure teams were focussing more or less in this order.</p>
Having working big outages like this, once the underlying IP network is restored and the upper layer network comes back up, there will be a plethora of impacts, alarms, metrics that need to be sorted through. </p>
When I worked a similar outage for a competitor, the team would find something like call failure rates are 5x normal, but also trending back towards normal. And then it's a tough thing to figure out, will it return to normal on its own? Maybe customers are rebooting their phones, so it looks like it's returning to normal but only because customers are taking their own actions. What options do we have to clean up the call states that are leading to those failures? Maybe the only tool available doesn't know who is and isn't working, so the only option is to reset every device. Is resetting every device worth it to move from a 2% failure rate to a normal 0.4% failure rate? </p>
What measures or steps were put in place in the aftermath of the earlier-mentioned April 2021 outage, and why they failed in preventing this new outage?</a></h3>
Everything substantive in the response to this question was redacted. It's unfortunate, because as an industry, including non-telecom like SRE can learn substantially from how Telecoms operate networks as reliable as they are. </p>
Basically Rogers did lots of vague things to prevent similar failures.</p>
How did the outage impact Rogers’ own staff and their ability to determine the cause of the outage and restore services?</a></h3>
At the early stage of the outage, many Rogers’ network employees were impacted and could not connect to our IT and network systems.  This impeded initial triage and restoration efforts as teams needed to travel to centralized locations where management network access was established. To complicate matters further, the loss of access to our VPN system to our core network nodes affected our timely ability to begin identifying the trouble and, hence, delayed the restoral efforts.  
</span></code></pre>
This is the inherent difficulty in managing network systems. While in my experience the network architectures tend to be layered, when you're also the ISP I can imagine the difficulty in this separation. And problems are rare enough that it gets difficult to predict exactly what outages will knock out employee access.</p>
Alot of the carriers also lease infrastructure from eachother, so one big telco can cause impacts in another.</p>
I almost wonder if there is room for something like Starlink here, as the satellite infrastructure would almost be guaranteed to provide diverse network access to a location, without running additional wires in the ground. </p>
Having experienced a similar failure, when I noticed my home internet and cell phone were all offline, I started driving to the office to get a stable connection to production. That was a long day.</p>
Extent to which Rogers sought or received assistance from other TSPs in addressing the outage or situation arising from the service interruption?</a></h3>
In order to allow our customers to use Bell or TELUS’ networks, we would have needed access to our own Home Location Register (“HLR”), Home Subscriber Server (“HSS”) and Centralized User Database (“CUDB”). This was not possible during the incident. 
</span></code></pre>
Basically what Rogers is describing here are the databases used to track and authenticate mobile devices were also unavailable. In cellular mobility, the network can be thought of as two networks for two different purposes. There is a visited network, more or less the radio towers you connect to. And the home network, which is what actually gives you internet, cellular, SMS, etc services. When you're on your own carrier, you're using your own carriers visited and home networks. </p>
But when you go somewhere else, say the United States, you change your visited network and keep your home network. What Rogers is basically saying, is there home network was broken, so the other carriers weren't able to help, because they wouldn't be able to authenticate devices, or connect them back to this home network.</p>
There is some tech that allows for internet connections to only use the visited network, but when I was in the industry this was not commonly deployed.</p>
Furthermore, given the national nature of this event, no competitor’s network would have been able to handle the extra and sudden volume of wireless customers (over 10.2M) and the related voice/data traffic surge. If not done carefully, such an attempt could have impeded the operations of the other carriers’ networks. 
</span></code></pre>
This is a key point. I don't know that any of the other major carriers would be able to predict how the sudden influx would affect their networks. While telecom is trying to move in a model alot more like SaaS providers that can autoscale equipment, I don't know how successful this would be.</p>
Let's pick something simple. The number of IP addresses available to be assigned to a phone. The other operators likely only provision enough to cover their own needs. Taking over another carrier in Canada isn't an expected failover pattern, so there isn't provisioning to handle this need. And IP addresses are likely only 1 of 10,000 different capacity constraints that go into a mobile network.</p>
My 2 cents is the most likely path to even considering allowing something like this would be more akin to multi-sim support. Where you would have credentials on your Sim card for both Rogers and a competitor network and the device could switch, if warranted. And then the possibility of this failover goes into the capacity planning for the networks and is only for customers that need it.</p>
Impact on Emergency Services</a></h2>
The impact to emergency services is really why I wanted to dig into this report, and try to understand specifically why Rogers cell phones were unable to make emergency calls. Having done some small work on 911 services for cell phones, normally any device is able to attach to any network, unauthenticated, to make emergency calls. Which means while the Rogers core network was down, something was causing the Rogers radio network to continue to advertise that it was able to take emergency calls. Otherwise the cell phones should have scanned for any available wireless network, and done an emergency attach to get emergency services.</p>
You don't need a SIM card in your phone, a valid account, etc to make a 911 call. </p>
Provide a complete and detailed report on the impact on emergency services of the outage that began on 8 July 2022, including but not limited to:</a></h3>
With respect to wireless public alerting service (WPAS”), the Rogers Broadcast Message Center (“BMC”) platform was operable to receive alerts from Pelmorex, the WPAS administrator.  However, broadcast-immediate (“BI”) public alerts could not be delivered to any wireless devices across Rogers’ coverage areas due to the outage. Based on a review of the alerts received into the WPAS BMC platform, the only impact occurred in the Province of Saskatchewan. There were four alerts, and associated updates, received but not delivered to wireless devices in Rogers’ coverage area. There were no other alerts issued, as seen on our WPAS BMC platform.
</span></code></pre>
I interpret this and other areas of the report to basically indicate that the IP core outage caused the cell sites to be unreachable. So to send an alert out to all devices in an area, you need to reach the cell site to send the message to phones.</p>
With respect to broadcasting (cable TV/Radio) alerts, our alert hardware is connected to our IP network. Since we had no connection to the Internet on July 8th, we were unable to send out any alerts on that day in the regions that we were serving.
</span></code></pre>
And also alerts on services that weren't cell sites all use the IP network.</p>
Whether the outage specifically impacted the 9-1-1 networks or only the originating networks, and if the former, how was this possible in light of resiliency and redundancy obligations imposed by the Commission</a></h3>
The outage solely impacted Rogers’ originating network. The 9-1-1 networks that receive calls from originating networks are not operated by Rogers. Rather, they are operated by the three large Canadian Incumbent Local Exchange Carriers (“ILECs”). They were unaffected by the outage.
</span></code></pre>
This is basically saying that Rogers is not operating the 911 networks themselves, as those are operated by Telus, Bell, and SaskTel across Canada. So the impact was isolated to Rogers network being able to deliver emergency calls to the 911 networks.</p>
Number of public alerts sent that did not reach Rogers’ customers, broken down by province;</a></h3>
Only four (4) alerts were received on Rogers WPAS BMC platform on July 8th. All alerts were in the Province of Saskatchewan. No other alert was issued in Canada on that day:

1. WPAS ID 957:  7:40AM CST: Saskatchewan RCMP – Civil Emergency (Dangerous Person)
2. WPAS ID 960:  4:05PM CST: Environment Canada – Tornado (Warning)
3. WPAS ID 964:  4:19PM CST: Environment Canada – Tornado (Warning)
4. WPAS ID 982:  5:31PM CST: Environment Canada – Tornado (Warning)
</span></code></pre>
I'm just including this as it adds perspective on the number of alerts Rogers wasn't able to deliver across the country that day.</p>
How were 9-1-1 calls processed during the outage and whether they were able to be processed by other wireless networks within the same coverage area</a></h3>
As seen in Rogers(CRTC)11July2022-2.i above, Rogers was able to route thousands of 9-1-1 calls on July 8th.  Rogers’ wireless network worked intermittently during that day as we were trying to restore our IP core network, varying region by region.

...

The connection state of the UE to Rogers wireless network, and the stability of our network, determined the ability of Rogers wireless customers to have their 9-1-1 calls processed by other wireless networks within the same coverage area.

Bell and TELUS confirmed to us that some of our customers were able to connect to their wireless networks in order to place 9-1-1 calls.
</span></code></pre>
The UE is the User Equipment, basically the cell phone.</p>
I think what Rogers is trying to say, is if the cell phone was in a state where it didn't see the Rogers network, like "No Service" on the display, an emergency call would do a network scan and attach to another network with service. </p>
Whether other measures could have been taken to re-establish 9-1-1 services sooner</a></h3>
No other measures would have helped restore 9-1-1 service on July 8th. One possible option that was explored by Rogers was to shut down our RAN. Normally, if a customer’s device cannot connect to their own carrier’s RAN, they will automatically connect to the strongest signal available, even from another carrier, for the purpose of making a 9-1-1 call. However, since Rogers’ RAN remained in service on July 8th, many Rogers customers phones did not attempt to connect to another network.
</span></code></pre>
If this is the case, what the report doesn't talk about at all is that non-Rogers customers could be impacted by this outage as well. If a competitor had very weak coverage, it might scan and find the Rogers network and try to use it for the emergency call, instead of finding a network to use. Or a device that is roaming and in airplane mode and then needs to make an emergency call, will try to attach to the first network it finds.</p>
So if this was the case, the Rogers network behaving like this could have prevented emergency calls from reaching a working network. This certainly would be the minority of devices, but I find this to be infuriating that the network sat there advertising 911 services, for hours, that it could not deliver.</p>
What alternatives are available to Rogers’ customers to access 9-1-1 services during such outages</a></h3>
The GSM standard for the routing of 9-1-1 calls implies that a wireless customer always has the option to remove the SIM card from their device and then to place the 9-1-1 call. The handset will register to another wireless network (the one with the strongest signal, even if there are not roaming arrangements).
</span></code></pre>
I think the problem is most people wouldn't know even what their sim card is, let alone that it can be removed to make a 911 call. But as with above, the problem here appears to be Rogers continued to advertise a 911 network, so the device might still connect.</p>
I'm also not sure that the strongest signal is correct. With so many frequencies now in use for cellular networks, it can take quite some time to do a full network scan. So I'm not sure devices in this state will scan all possible networks and then choose the strongest signal. I think but am not sure that the devices would connect to the first network found.</p>
Further, some newer smart devices have the capability to reconnect automatically to other wireless network for 9-1-1 calls when the home network is down. 
</span></code></pre>
If this is the case that's great. It would probably be a great idea to put this into the standard.</p>
Summary</a></h1>
I really wanted to dig into this report, as I've worked these sorts of outages. And they are really difficult, as there are infinite ways to make the problems worse, lots of contradictory information, theories of the root cause, and missing information. It is really difficult to operate these networks.</p>
But what I found most frustrating, was the 911 services. The core network is down, but the radio network is continuing to indicate 911 services. While I only suspect this is the case, there isn't a good reason that other networks couldn't be used, other than a desire to do so. I really hope this doesn't get lost on all the major carriers after this outage.</p>


ShellGuard: Week 2 - Road to MVP
2022-07-11T00:00:00+00:00
There isn't a huge deal of progress to talk about this week. Most of the progress has been in intangible areas. The product demo I've been working on is making steady progress, but it offers no security benefits. So it's more of a UX demo, and will need to be rebuilt to offer an actual MVP. </p>
So now that I'm able to demonstrate most UX patterns I'm thinking of for ShellGuard, I'm starting to consider when to pivot into an MVP. The smallest product that provides real security benefits, so that I can start getting users. I think I've figured out enough for an MVP, so it's time to consider how much further I want to take the demo.</p>
On the non-tangible side, I've binge read The Mom Test</a> to better ask the right product questions, and been doing some research into other approaches. Basically, I want to figure out if there is a market here or not. If shell and user security is an area companies will care about enough to invest into solutions, if new solutions are presented.</p>
I've also started thinking about how to design the best UX patterns. This is harder than it might seem, as even in the prototype there is already conflicting language that would be confusing for users. </p>
Anyways, not a huge amount to report, but making some steady progress.</p>


ShellGuard: Week 1
2022-07-02T00:00:00+00:00
This week, I decided to start a new project called ShellGuard. For the last year or so, I've had an itch. It all started with a customer at Teleport</a> who had concerns about system administrators ability to access customer data on the systems they administer.</p>
My knee-jerk thought is this is an impossible problem. The security controls don't create a clear way to allow a user to change the system, but not touch data. Even if you can write some clever selinux policy or BPF rules, it's generally trivial to bypass that protection.</p>
Even if we narrow our view to just audit, common syscall / BPF audit mechanisms are easy to bypass. They're more for theatre than actually catching a sophisticated breach of security.</p>
But it bugged me... badly... that we can't restrict access. </p>
And that's where the idea of ShellGuard came in. We already have technologies for running untrusted workloads on a Linux host. Things like (gVisor)[https://gvisor.dev/] and (Firecracker)[https://firecracker-microvm.github.io/] provide lightweight ways to virtualize or isolate workloads. What strikes me about a solution like gVisor, is basically the ability to redirect syscall handling to another application. Why can't we do the same thing with a shell session. If we wrap the user session within an application Kernel like gvisor, and insert ourselves between the user and kernel, we can implement a new policy engine.</p>
This idea fascinated me for a few reasons. </p>
A user can really be setup with a read only view of the host. Policy could be designed in a way where a user really has multiple use cases. I need internet access to GitHub and database access. A user running untrusted software could be exploited to upload the database to GitHub. But what if you could write a policy that says, you can get internet access and database access, but not at the same time. There isn't a path left open customer data to reach the internet, even if the user needs both as part of their job.</p>
So week 1 has mostly been about getting started. I've started to put together a tech demo, so that I can begin to show the concept and how it would work. I really thought isolating user access impossible... but the more I think about it, the more I believe in a world where sensitive access can be segmented away. </p>


Outage Stories: When DNS takes out cell phone roaming
2022-05-29T00:00:00+00:00
When I used to work in Telecom networks, on technologies like LTE or UMTS, I was often tasked with investigating and resolving hard problems. A number of these outages were caused by the way Telecom networks use and deploy a DNS architecture separated from the internet. It's the same protocol, but runs on it's own networks with separate root servers and a few oddities you wouldn't expect on the internet.</p>
In this post I'll talk about one of the outages I investigated, where planned maintenance, an unplanned outage, and a miss-configuration led to a triple redundant architecture going down and breaking users roaming into Canada from getting an internet connection.</p>
I've been out of the industry for a number of years now and I'm putting this back together from memory, so there may be some inaccuracy. </p>
Background</a></h2>
At the time, network operators such as the one I worked for generally operated as integrators of proprietary solutions. Think of a company like Cisco, that makes large scale routers for carrier grade networks, the same applies to Wireless. This generally creates a few challenges, troubleshooting requires builtin tools, logs, metrics or to look at the equipment externally, such as by taking traffic captures. We rarely had the source code or internal knowledge of the products.</p>
When our company originally deployed its 3G wireless network, someone just built a half dozen linux servers and installed ISC BIND to act as DNS servers. These DNS servers were effectively used for service discovery within the network, and to facilitate roaming between networks from different carriers. Because these servers were just thrown in as a network build out in a company culture at the time that doesn't really manage servers, they immediately went unmaintained. </p>
These unmaintained DNS servers were eventually replaced with a new set of DNS servers that were provided by a vendor as an appliance and had a team assigned to maintain the new system with proper vendor support for hardware and software. </p>
Even though all of this service discovery was built on the DNS protocol, LTE and UMTS used an architecture completely isolated from the internet, with its own set of root servers and a nice set of standards violations which caused a nice set of problems. In our case, the root servers were provided by a roaming exchange, which can be thought of as a company that exists to connect wireless carriers and facilitate how your cell phone can move from one carrier to another.</p>
The Outage</a></h2>
The roaming exchange we were partnered with normally had 3 DNS servers in operation. These servers are sort of a combination of DNS roots and Zone servers in one. In internet terms, it’s as if the root and .com servers were running on the same machines. But these are not the internet root servers, they're something custom just for the wireless telecoms. </p>
The roaming exchange had taken one of their servers down for extended maintenance and then experienced an outage on a second server. And our connectivity to international networks went down, even though there should have been one working server. From our perspective, it was as if the internet root servers all disappeared. Our internal network was fine, just the rest of the world blinked out of existence. </p>
The outage was eventually recovered by the roaming exchange fixing one of the down DNS servers.</p>
The Investigation</a></h2>
For 2 of 3 root servers, the root cause was pretty straightforward and was up to the partner to solve. The problem I was brought in to investigate, is why we weren’t working with the third server which according to the partner was online and working throughout the outage. No other complaints according to them.</p>
The summary of the initial findings was something along the lines of:</p>

There was no network cause for the failure. We could reach the server. </li>
Testing and packet captures showed bi-directional communications:

Normal communications for several seconds after DNS service was restarted on our new DNS resolvers.</li>
Manual queries from the host were working, despite the server itself not trying to communicate with the particular root.</li>
</ul>
</li>
Our old DNS servers were still running and did not experience an outage to the working server.</li>
</ul>
Note: The new and old servers were both based on ISC Bind, but different versions.</p>
Inspecting the DNS caches on the new servers, showed records related to the working root server were missing, while records towards other roots were present.</p>
At the time I likely had an average understanding of the DNS protocols and services, similar to anyone who has had to configure records to set up a website. We engaged our vendor support and began working with someone whose expertise was specifically in the DNS protocol and specifications, and even escalated to basically the guy who wrote the book on DNS. The outcome of that discussion was that the vendor would write us a custom patch, just for us, the exact details of which I don’t remember.</p>
The Sniff Test</a></h2>
The difficulty was that this solution didn’t seem to pass the sniff test. There were too many unexplained components to the problem, that writing a patch may or may not resolve. But if nothing else, it was clear that we hadn’t established at this point what the problem was, and suggesting patches or features was premature.</p>
So I set up several VMs that I could use to simulate every host in the production network. I queried the real servers, and then configured my simulation with matching records. I had effectively recreated the production DNS servers of our company, the roaming exchange, and a roaming partner. I was able to reproduce the problem and began experimenting.</p>
I built 10-20 releases of Bind, covering the versions between our old and new servers until I had isolated a series of patches where the problem began. I went through the release notes and nothing stood out but found that the patch that broke our DNS was more than a year old. It seemed unlikely that we were the only ones to have this problem across a year+ of patches. I ended up setting aside this line of investigation before investigating the source code for each patch.</p>
Clues</a></h2>
Using the simulation I eventually narrowed in on a set of clues:</p>

In the vendor UI, the root servers were simply called something like root server configuration. However, in Bind we configure a “Root Hints”, an initial set of records to locate other DNS servers. </li>
When starting Bind and inspecting the caches, records within the root hints would be present.</li>
Going packet by packet through the startup sequence, there was a discrepancy in the responses to AAAA queries to the root servers for the names of the root servers. The working servers would return a no data result. The non-working root would return a name error response to the AAAA query.

Note: This was an IPv4 only network, so it didn’t seem like focussing on AAAA (IPv6 records) was an important detail at the time. Turns out it was quite important.</li>
</ul>
</li>
After the startup queries were returned, the records for the root server were missing from the cache.</li>
</ul>
I reread the DNS standards a couple of times, top to bottom until the following section of RFC 1034</a> stood out.</p>
When the resolver performs the indicated function, it usually has one of
the following results to pass back to the client:

   - One or more RRs giving the requested data.

     In this case the resolver returns the answer in the
     appropriate format.

   - A name error (NE).

     This happens when the referenced name does not exist.  For
     example, a user may have mistyped a host name.

   - A data not found error.

     This happens when the referenced name exists, but data of the
     appropriate type does not.  For example, a host address
     function applied to a mailbox name would return this error
     since the name exists, but no address RR is present.

It is important to note that the functions for translating between host
names and addresses may combine the "name error" and "data not found"
error conditions into a single type of error return, but the general
function should not.  One reason for this is that applications may ask
first for one type of information about a name followed by a second
request to the same name for some other type of information; if the two
errors are combined, then useless queries may slow the application.
</span></code></pre>
It’s subtle...</p>
Analyzing DNS as a protocol, it’s easy to think of a request and response for a particular record type as an independent operation. From the way we interact with and configure DNS servers, we create records independent of each other. But this particular portion of the standard allows a caching resolver to draw a relationship between different record types based on the error result, specifically whether the name exists for other record types.</p>
So BIND was drawing the inference it was supposed to, the particular name received a name error (NXDomain) response, which meant the name doesn’t exist for any record type. So when performing an IPv6 system query, it should delete all records from its cache that contain the name that doesn’t exist, even those entries that we statically configured in our hints file for the IPv4 address of our root server.</p>
So our IPv4 records for who is root, were getting deleted by system queries that were checking to see if any IPv6 addresses were available for those root servers. </p>
Root Cause</a></h2>
With this breakthrough, I was able to document what was happening:</p>

During startup/reset, the DNS server would load the root hints and prepopulate the cache to use those roots.</li>
During startup, it would also see it only has a bunch of IPv4 roots, so would start querying the roots for the equivalent IPv6 records.</li>
For one of the upstream servers, the response for IPv6 records would return NXDomain, which we now know that name doesn’t exist anywhere and would clear the matching IPv4 records from the cache. The missing records from the cache wouldn’t allow the particular server to be used as a root.</li>
There is one detail I’m not sure I remember accurately, which is how the servers were responding to the root system query and the interaction with these statically configured records. </li>
Our new servers had never worked with this one root server, it was missed during commissioning that the root server was not being used and had never worked since the new servers were added.</li>
The underlying BIND implementation didn’t always implement this behaviour, buried in a patch somewhere is an update to follow the standard more closely, as in it’s plausible a bug was just noticed and fixed.</li>
We had the wrong server name on both the old and new servers, only the new servers had the relevant patches. If we had kept the old servers routinely patched, they would have had the same problem.</li>
One of the server names we had for the roots was wrong, this is why we were getting an NXDomain response on the AAAA query instead of a no records response.</li>
</ol>
If the 3 roaming exchange servers were called:</p>
dns1.grx.gsma.
dns2.grx-partner.gsma.
dns3.grx-partner.gsma.
</span></code></pre>
Our configuration was:</p>
dns1.grx.gsma.
dns2.grx.gsma.
dns3.grx-partner.gsma.
</span></code></pre>
As I recall, the way this problem happened was that the roaming partner was rebuilding their DNS servers one by one. And the rebuilt servers were using a new naming scheme, but they had gotten ahead of themselves and the datasheet used to exchange information between companies was out of date or miss-matched, so we still had an old name for a new server with the same IP address. </p>
This allowed us to draw one more conclusion, we were the only carrier partnered with this roaming exchange using BIND that had updated our installation in the previous year, because we were the only ones who reported going down when the other two root servers were down. </p>


Bitcoin isn't for everyone
2022-05-14T00:00:00+00:00
Bitcoin can be a polarizing topic. There are certainly a strong following of bitcoin and derivatives, but many of the engineers I know are opposed to the technology. </p>
What motivated me to finally get this post out however, is the other week I encountered a non-engineer, non-technical advocate for Bitcoin. They were a restaurant worker, and explained they have a side gig of helping others gt into bitcoin. And we had a bit of a discussion on why I don't believe in Bitcoin, or derivatives, as a form of asset, payment, etc.</p>
And it comes down to, sitting at a restaurant, look around the room, who in that room can effectively and safely use Bitcoin. </p>
Who can safely use Bitcoin?</a></h1>
Almost no one.</p>
In the tech industry, working with very technical folks, it's easy to forget most people are not tech workers. Computers are a tool too many, but they don't understand how they work. </p>
I had a conversation with my Sister once, where I tried to explain just what a block chain is. And she didn't get it. Not that knowing how it works is necessary. The problem with bitcoin is, someone needs to handle some secrets pretty securely to not lose all of their money. Look around the room, and identify who can really employ the personal security required to effectively use cold wallets, verify destination addresses, not to store their secret incorrectly, backup the secrets incase of a hard drive failure, etc. All techniques required to ensure their money doesn't disappear.</p>
So I think Bitcoin has a unstated premise, that users need a high degree of expertise in best practices to even use the technology.</p>
Most people don't have the time for this, or the technical basis to even learn it.</p>
What bitcoin doesn't do?</a></h1>
Living in a stable western democracy, I think there is a bit of a flaw in the concept of Bitcoin. In our techno sphere where we're used to controlling computers through our code, we forget that our current money system has lots of checks and balances in place, and banking over decades has built many rules to accommodate exceptions, many of which we as individuals have never had to deal with in our lives.</p>
So I like to think of problems in the sense of properties:</p>
Mistakes</a></h2>
Mistakes will happen. Many by user input failures, ignoring checks and balances, software bugs, rushing, or even hardware failures that corrupt memory. The current banking system doesn't always get this right, and it may not be instant, but in many cases banking failures can be corrected. </p>
Bitcoin, as a cryptographic ledger, immortalizing all mistakes. The only reversible change in the theory of blockchains, is to have the destination run a second transaction to return an amount. Sure, developers have gotten together to tamper with some of the chains to reverse problems before, but no one is fixing the $1000 sent to an address that doesn't exist.</p>
Power of attorney / Estate</a></h2>
It's easy to forget, that in some circumstances at some point in our lives our assets may need to be turned over to others. Whether it's a death, and turning over the estate, or some other condition that leads to or requires power of attorney. </p>
I'm sure the Bitcoin advocates can point to failures of this system, or unjust conservatorship that's been applied. But on average, this is a system that applies to millions of people, that doesn't really have a Bitcoin equivalent. Recovery from sudden death, means a pre-planned method of preserving and turning over secrets, that not many will do. And then the additional operational security of only having the secrets turned over when required, and that the storage of those secrets doesn't become the weak link to steal all of someones "money".</p>
And something as simple as rotating your secrets, now requires many additional steps.</p>
Court orders</a></h2>
So you went to court, got a judgement. Maybe it's big, maybe it's small. And the opponent doesn't pay what is ordered by the court. Bitcoin removes an option here for enforcement, if the money is sitting in a bank. If the judgement is large and the party simply decides not to pay, are you going to garnish wages for the next 20 years? When the money you're owed is just sitting their.</p>
Hell, in these types of conflicts someone could in a fit of deviancy decide to just destroy the amount by transferring to nowhere instead of complying with the judgement, and then good luck.</p>
I'm just waiting for the first court cases to explore Blockchains, and orders for transfers. </p>
Legal Enforcement</a></h2>
I realize for many this might be a feature, but I think of it as a bug. For the most part I agree with law enforcement holds, as long as their is a strong judiciary process, and government sanctions. Can someone find a sanction I disagree with, of course. But I also live in a represented society where I'm not going to agree with every law enforcement or government decision, and that's fine.</p>
And there are real problems with some of these systems, asset forfeiture is rampant, but do I really believe that using a digital wallet is going to stop a government from stealing my stuff. We need to demand better of our institutions where they fail, not work around them and opening the flood gates for criminal behaviour. </p>
For those who don't have strong institutions</a></h1>
I know Bitcoin is having some success where the commerce of the state is failing, for various reasons. I don't have a good answer for that use case. But I don't believe in the long run, that Bitcoin successfully replaces the institutions of banking, legal recourse, and security for those who don't have those institutions.</p>
So there are very real problems to be solved here, and I understand why Bitcoin does get involved, but I think we should demand more here, not try to duct tape a severely flawed solution that forgets what purposes institutions serve in our lives. </p>
Summary</a></h1>
While I'm sure there are some who will sit around and say all of these are features. The times the court fail, government mistakes, illegitimate power of attorney occur could all be avoided with Bitcoin. No one should be able to force you to pay a penny you don't want to.</p>
But in my opinion, Bitcoin simply isn't a technology the vast majority of the populace can use safely, or to achieve goals they can easily do with the current banking system.</p>


Comments are for why, not what
2022-04-30T00:00:00+00:00
I suspect too many dev teams have gone a bit too far the mantra of code should be self documenting. When interviewing developers, this is something I often pick up as a negative signal, is poor commenting style.</p>
So this is just a reminder, that the code tells you what and how, and should be as clearly written as possible for other developers to follow using just the code. But this mantra doesn't mean there should be no comments... comments are still important to indicate to the reader why a piece of code exists.</p>
If a piece of code exists to implement a customer requirement, link to that requirement, external source of information, or simply explain the problem being solved. Otherwise, those who come after you, including yourself, will be doomed to repeat a bug, lose a requirement, or be stuck when changing code later.</p>
See Code Tells You How, Comments Tell You Why - Coding Horror</a> for a better explanation.</p>


We don't know what we don't know
2020-07-16T00:00:00+00:00
Sometimes it's important to remind ourselves that we don't know what we don't know!</p>
I recently began to spend more effort considering and emphasizing what pieces might be missing from a particular engineering puzzle. Like many engineers, I rely heavily on mental models when tackling technical problems. The problem is, these mental models are seldom complete, with gaps or assumptions, which leads to mistakes or discounting approaches to a question we didn't know to consider.</p>
Let's spend some time considering what it means not to know what we don't know and how we can remind ourselves of some of our own biases.</p>
My first experience with WDKWWDK</a></h2>
For me, this realization first came during a meeting.</p>
We were trying to tackle a specific security challenge that required selecting a cryptographic protocol with particular properties. I was in the position to make the recommendations, which came down to Noise</a> based on some industry experts'</a> suggestions.</p>
The problem we had is we weren't able to locate an implementation of the protocol for our embedded CPU. While I was evolving into a security lead for the team with a better than average understanding of security principles, I was utterly ill-prepared to write a crypto implementation that would pass any scrutiny.</p>
However, with the most knowledge, the team asked me to develop the implementation.</p>
I objected, indicating that writing a cryptographic implementation is detailed work, and it's challenging to be sure that the implementation is correct.</p>
The team countered, saying that it must be correct if the implementation can communicate with another library (on a known CPU architecture).</p>
The problem is it's possible to write a compatible protocol implementation, that has significant security problems. Re-using nonces or timing attacks could ruin our day, and we'd easily miss out on the security properties we desired.</p>
The team countered again, indicating that if I knew about problems like nonce re-use or timing attacks, that I should avoid those problems. </p>
Any problem I can describe in theory should be a problem I can avoid.</p>
Eventually, I had to figure out to approach this engineering conversation as an admission, I don't know what I don't know, so I'm ill-prepared to solve problems I don't even know to avoid. </p>
Not only did I need to admit what I didn't know, but as a team, we had to admit combined that we didn't know what we didn't know. In many scenarios, we may be able to get away with this, but here, a failure meant we failed to deliver on our security promises to our customers.</p>
Is this the Dunning-Kruger effect?</a></h2>
For those who aren't familiar, a simplified version of the Dunning-Kruger effect states that a cognitive bias exists where individuals with limited knowledge or competence in a given intellectual domain greatly overestimate their expertise or skill in that domain. And that those with a great deal of expertise tend to underestimate their results. </p>
This cognitive bias is fascinating and may explain some of the "solutions" or "projects" we see that claim perfect security while not holding up to basic scrutiny.</p>
While the Dunning-Kruger effect may apply here, we often forget that this cognitive bias only means that people scoring in the 12th percentile had ranked themselves in the 62nd percentile. But those in the 90th percentile had self-assessed to be within the ~75th percentile. </p>

Dunning-Kruger effect on research gate for more information</a>


</p>
With this in mind, on average, individuals without much exposure do not rank themselves as experts in an area they don't know.</p>
We can still draw an important lesson here; that we need to be aware of and careful of our biases, whether it be the Dunning-Kruger effect or something else.</p>
Unconscious Fear</a></h2>

xkcd 1053</a></p>
As engineers, I think we get used to analyzing complex problems and coming back with answers. Especially those of us who are known for digging in and finding the solutions.</p>
When given a task, there may also be a particular fear associated with not having the answer. If your leadership or team members expect you are an expert in a domain and are assigned a problem, it feels like a failure to return without the full solution. </p>
For me, it's even come to the point where others have tried shame me in front of my peers for not knowing some technical detail off hand. This behaviour creates perverse incentives that it's better to stick with an assumption than admit we don't know something. I know I've unconciously affected others in this same way, which is one of my regrets. </p>
A large part of this is realizing it's not a failure to admit that we have missing knowledge. We may not remember some details or have more research to do. Maybe there is no action required. The failure is to proceed without identifying the risk of the missing information. Whether an expert or not, there are always going to be missing details. When working in security, the missed detail could be the difference between high and abysmal security. In other domains, the risks may be much different.</p>
Knowing the risks of missing information allows us to choose whether to accept those risks. </p>
How do I apply this?</a></h2>
I think it's as simple as stating, "We don't know what we don't know" in a conversation, or typing WDKWWDK into a document or chat session. And ensuring our teams and leaders don't allow a stigma to exist around not knowing some details.</p>
I'm not the first</a> to use this acronym, but I also don't believe we use it enough.</p>
This statement of fact, acts as a reminder, that we should step back and consider how our biases may impact a particular decision or design and to identify what information may be missing.</p>
We can then consider:</p>

What we know</li>
What we know we don't know</li>
If there is anything we don't know what we don't know that needs exploration</li>
</ul>
And decide on the risks of proceeding with missing information, cognizant that there will always be missing information.</p>
Summary</a></h2>
Before committing to an approach to an engineering problem, remind yourself or the team that there are likely missing details that may affect the outcome. How to proceed largely depends on the risks involved, topics like security have different risks than the CI/CD pipeline. </p>
And we need to remind ourselves and our peers to avoid the stigmas and biases introduced by admitting we don't know things. Being an expert in a domain like security, doesn't mean memorization of every algorithm, every protocol, or approach.</p>