Your team has actively hindered troubleshooting this problem for a long time. The obvious goal here is to get an exclusion by claiming that the product doesn’t work. I’ve seen better acts over the course of my career, so this one is easy to see through. So let’s recap it.
2 years ago, during initial rollout, there was an issue. You reported it to the team at the time, who told you they weren’t sure of the issue and asked you to roll it back until we could come back to them, which you did. In the fall of ’21 someone finally noticed that you were non-compliant. After much moaning, you finally came back to work on the issue. You deployed to your staging environment with no issues at all. Great.
We moved forward to a small deployment of production, which went well for a short time and then had issues again. You whined about it a lot while making sure not to provide helpful information. This issue was escalated to the vendor who informed you about the logs required, and the process to gather them. Instead, you sent kernel logs that were effectively worthless. You knew they would be useless but did this to keep your whining alive.
After more bitching and moaning, and delaying, you reinstalled the product, and you got a sample of the requested logs. But you did so from a system that was not currently an active node and wasn’t experiencing an issue. That limited their usefulness, as you knew it would.
At this time, you made a big deal out of bitching that your system can’t survive a single millisecond of delay, which we all know is bullshit since network latency to end systems is much more variable then that. But allowing for the sentiment to keep things rolling, we moved forward. The vendor provided 2 potential things that may resolve the issue: upgrading the product to a newer version that doesn’t have a known issues with scripts that contain large numbers of piped commands, or creating an exclusion for /opt/yourapphere/. You have tossed both suggestions out because you don’t understand the reasoning. I don’t expect you to, since you’re not experts, but maybe you should try listening to the experts in the future. Or you know, letting us explain at all.
The first solution would resolve the issue if the source is in the script module where your system uses too many piped commands. There was a bug in the version, and it was resolved. This is the less likely cause.
The second solution assumes that your overloaded system is so messy that just logging activity from it is putting it over the limit. Due to some internal aspect of the product, it has throttling that can kick in if it’s monitoring a bullshit large number of threads. If we exclude /yourapphere/, that may keep the process nonsense of your badly designed and highly overloaded system from crashing.
But you don’t want to hear that. You want to cry and bitch and moan in an attempt to get us to give up and go away. Because your goal here isn’t to fix things. It’s make it out like you can’t possibly be compliant so we’ll go away.
It’s bullshit. We know it is. You know it is. And we should stop pretending that it’s not.