2020-11-16 | Virus | Collapse of All
I was hired five months ago to work at a company based on my approach to systems analysis. Here is a short version of what that means: I have been desperately, obsessively trying to simplify the typical enterprise architecture (EA) analysis that shops with mature IT processes use, and put it in a form that everybody can use ... well, anybody that has ever written a tool using a scripting language. I'm using tech that all of the big cloud companies are using to run their stuff at root. They necessarily are relying on the work of others, and I know what that work is, where it comes from, how to use it, and how to package it up in my own OS. Not only have I shown that this works, but I have traced the main intellectual effort behind some of these ideas. I can tell you where many of the main players landed. I put up a demonstration site last year, and the CTO of the company saw it, said we needed that kind of analysis, and they hired me.
After one week on the job, I realized that they had no idea how their infrastructure was really running, and it wasn't monitored, so I volunteered to postpone any data flow analysis, and used my modeling ideas to generate a vendor engagement design for monitoring. If you are in the computer world, you know that monitoring is somewhat unique, in that it spans all areas of compute, applications, storage, and networking. There are advantages to being able to correlate failure across these areas (failures are generally event based... on a timeline). Additionally, performance metrics are useful both for correlation, but for planning. If you work these areas separately, without any idea of how you are going to standardize the data or integrate it for correlation/planning, you lose much of the benefit.
Another problem with monitoring is that every group has their pet needs. If you just put something in right away, there is a constant flow of tasks related to a must have need. At the same time, I'm aware of that any form of analysis quickly ends up being labelled analysis paralysis. I figured that I would solve this by presenting a "simple" what and how approach to the monitoring requirements listed by four classifications. "What" is stuff like network interface state, hard disk able to write and read data, or CPU load. "How" is stuff like servicing alerts, correlating errors, and mapping dependencies, so that if rack power goes out that fact isn't buried beneath the 100,000 alerts behind that rack power dependency.
I knew they didn't have the crew to build all the monitors, so I wanted to make sure we found something that covered all of this. I did not fully investigate every area, to simplify. For instance, I'd simplify stuff like hard drive monitoring or OS monitoring or service state, assuming that any monitoring package that monitored that class of item was good enough at the level of money we were considering. As for the four steps for ranking, I've been in many efforts where somebody suggests "what about this?" and they or somebody else will bring it up at every. single. meeting., but nobody can ever come up with a good justification for it. It is just a game people play with new systems or any kind of analysis, really. I've done it myself on the other side. Sometimes it yields goodness. Still, it is useful to capture it as in "meh you have to do some work if you want this as a consideration, don't bring it up again, it is here". Call that (4). The other three were: 1 - we want to buy this and it needs to be turned on by end of the year. 2 - we want to buy this, but we will get to dealing with it later... next year sometime, and 3 - we want to be extensible to this at some time in the future but we don't want to buy it right now. It will just slow things down.
I got consensus from all stakeholders, and mapped to what, how, and priority levels. By all stakeholders, with a team so small, that meant everybody on the IT team was included, because they worked directly with the monitored items. This took a couple of months, but considering the fact that they had never successfully implemented monitoring in the past, it seemed like a win to me. Right about this time, the CTO told another outside consultant to just turn on some monitoring that he claimed we already owned, which nobody had shared with me or even knew about. I wanted to be a team player, so I pivoted to: OK, well, since we have this map of what we want, what are the gaps between this thing you just turned on and what we agreed on just last week was what we wanted?
All shut down. No response for several weeks. I worked on my OS and MCJ on my own time, waiting, and tortured my friends with emails and posts of tech I was fascinated with and building. Finally, I ended up working directly with two engineers that worked at the monitoring solution the consultant said we owned, to figure out the gaps. While it was true that the solution could actually do what we needed, it wasn't true that we owned it, so I brought back a bid of exactly what we would need to purchase to fit our requirements. Further, I demonstrated with the engineers in a recorded video, all of my higher level ideas (not that novel) around correlation, detecting anomalies, and capacity planning on the exact product. The engineers devoured my work product of the monitoring map. We created something good and useful.
By this time, the consultant had moved on to yet another idea. Still, I had a good set of requirements, and a video demonstrating the more sophisticated ideas live on a product that I had pricing on. All that needed to happen was management (multiple levels, as I have two bosses, one of them reports to the other, and the CTO is above that) needed to digest this information and decide to purchase. All was shut down again. I'm not getting any money. This went on for a month. I worked on my geek personal projects and tortured my friends more with MCJ.
Both myself and the vendor prodded management for status, but nothing came back. Late last week I met with my boss, and the conclusion was that my approach was too complicated, and likely we would just toss any solution out after two years anyway. So here is an error on my part. I assumed that the solution would last longer than two years. I was transparent in my requirements analysis with my boss(es) at every step, so they certainly share some blame for this. We were considering products that cost 100,000 first year and 70,000 recurring, so the idea that we would just put something in over 3-6 months to get fully operational across all IT (compute, storage, network, etc.) run it for a year and then buy a different one was so far out of my world view that I missed it. In this new world, remember that folks. There is a lesson here. Ask that question. This blindsided me.
A secondary problem that my boss admitted, was that they didn't have the capacity to make a decision based on what I had provided. Note that my boss has reviewed and agreed with all that I had provided prior... the issue is that from an organization perspective they are really only able to kick the can down the road a week at a time. In that same meeting with my boss, we decided that I would focus on just the networking piece, as that was a more immediate need (much is currently not monitored at all). I agreed. Not wanting to repeat my same mistakes, I asked what currently existed (note I did ask this above, and even interviewed the stakeholders on as-is... in the end I didn't miss anything because the consultant was outright wrong). I found out that we did have a product, and there were some reports coming out, and we were monitoring, but nobody knew how to use it. I suggested that I dig in and figure out why it wasn't working for people, first, and that would lead to the requirements for free. If I could figure it out, great, we had it, and it was working. If not, we could purchase something that fit the need that wasn't filled. I'm shooting really low now, based on my previous experience.
My boss went to the CTO and the CTO vetoed the whole thing and said we had tried that in the past with consultants and got nothing, so we needed to abandon the current system. Further, that any monitoring system we purchased should provide what we need, so we didn't need to do much analysis. (Somehow, this is not in conflict with the fact that they purchased a system that nobody can use for some reason, and they are going to toss it. I am confused by this.) The conclusion is that I will put in a system for networking after establishing the most basic requirements like capacity. The punchline is that I likely have another two months of consulting serving as a technical lead implementing the monitoring system.
Now, you might think that I'm picking on this particular org. There are some obvious failures here. But it is my experience that this is how stuff like this happens at most places. We have generally become incapable of forward motion on our own volition towards an agreed on goal. We need to break up our efforts into smaller concerns, regardless of whether it fits our broader goals (like prediction, detecting anomalies and correlation). The takeaway on this, since I can do stuff like build an OS from scratch and network it, is that I can continue on contract to put this in and have income during the pandemic crisis.
Now for some tangents. If knowledge built civilization, and that knowledge primarily started with written language, then what does collapse imply? It is likely that this is correlated. That is, if we put in systems on a five year timeline, the rules hold. But what if things change so fast and the system grows in complication so fast that we can't design? True, this was my motivation for agile analysis; however, the problem is more with the capacity of people to actually use and create knowledge towards specific goals. This is illustrated by my somewhat absurd example above. I often harp about the idea that as we cede this stuff to corporations, we are ceding goals (their goal is profit, not necessarily our goals as a species or org or individual).
Even as I put together my OS, there are some items that I can't completely control. Both Rust and some of the Python frameworks require network connectivity to pull their components. That is probably too far out on a tangent to bring in, but in my mind I see the ecosystems of software and operating systems mapping with the same general problem with knowledge and complex systems.
The conclusion, always, for me, is that we are in the middle of collapse, and all is explained by this. Systems are too complex for individuals to own. Our intellectual capacity is hijacked by images and BS (like Huxley warned about). We have ceded this to large corporations. Frankly, I'll protest the oligarchy on this, but somebody has to build and operate new things, so they get it.
The same thing goes for political discussion. The level of intellectual sophistication has gone to close to zero. We use three word statements to express our political stance. We are in collapse, collapse of civilization, and collapse of knowledge, at least knowledge built and maintained widely by humans without AI. BUT, there likely will be a persistent thread of humanity and knowledge that will continue to operate. Here is a weird idea. The thread that is operating is more and more AI. Now, it is AI with a goal of profit, but what happens if and when the AI plans further out? We need to protect consumers. True, at this phase it might be keeping consumers using zoom and staying at home and buying our products (destroying all traditional human economies), watching netflix and posting the current identity passion.knowledge