Is there anything worse than a crashed Endpoint?

STEP is likely to be a core system for any organisations that have invested the time and money to implement it. Consequently, there will be large quantities of data going in or out at various times, in many cases complex Business Rules must be adhered to/executed. The last thing the Business needs is one or more of these processes failing. So why do integrations or event processors crash? And what happens when they do?

What does STEP capability look like in your organisation?

A sneak peek into the workings of an innovative STEP MDM development team. I will discuss some of our responsibilities as designers and developers of the system, and how the Business are able to leverage this capability to continuously expand the benefits that STEP delivers..

How do you stay an expert in your own area of expertise?

Stibo Systems talked about Innovation, Partnerships and Sustainability as the three pillars of their corporate strategy. Sometimes these high level values can feel very abstract when you’re wrestling with the challenge of maintaining your in-depth knowledge of your STEP system at the same time as adding new features and enhancements to satisfy your business users.

Why Data Veritas, why now?

Data Veritas is me trying to help myself get better; by reflecting, by exercising my professional chimp, by documenting past glories and failures. Along the way it may help others. Either others like me, STEP Consultants, or perhaps “down-in-the-dirt” business users. You know… the ones who actually have to use the system once it’s been delivered.

Is there anything worse than a crashed Endpoint?

At many of the STEP clients that I’ve worked with there have often been issues – some minor, some major –  with Integration endpoints and Event Processors crashing and consequently the flow of data or the execution of crucial Business Rules ceasing. While these Endpoints are down, the queue of events waiting to be processed can only grow. These process failures can be caused by a myriad of reasons, but for the purposes of this article I will simplify them into two groups. Causes within our (STEP system admin) control and causes outside our control. 

In my experience, causes within our control generally stem from exceptions thrown during the execution of a Business Rule. This can come from (among other things) a bug in some Javascript code, alterations to underlying configuration making a Business Rule invalid (e.g. removing an LOV value that is set as part of a Business Rule) or attempting to run some logic on an object that has been deleted.

Although there are many interesting things to discuss here, I’m going to gloss over issues that are our fault (note, I’m not dodging scrutiny here, merely delaying it until a later post) and talk more about the second group: causes that are OUTSIDE of our control. Again there can be a myriad of reasons for this, but the specific situation that brought this article idea to mind was a problem we faced relating to JMS (Java Message Service) queues in the Enterprise Service Bus (ESB). We were faced with a relatively frequent problem where our STEP system was unable to make the connection to the JMS queues at the ESB, resulting in the Inbound or Outbound Integration Endpoint (IIEP / OIEP) failing. Not infrequently, this would happen at the worst possible time ( 6:30 pm on Friday, just before all the scheduled mass updates are processed). The consequence of this issue on our Business Users was not catastrophic, but it would often mean that when they came back to work on Monday morning the mass data updates would not have been processed, and would not be processed until a member of the team went into STEP to re-enable the failed endpoints. This delayed processing would mean that the Business would be faced with a sluggish system as it tried to do the weekend mass updates in addition to the BAU tasks being done by the Users. In some circumstances it may lead to confusion as the data took time to flow through and synchronise with other systems in the landscape. So generally a feeling of frustration, annoyance and sometimes anger when they realise they may have to do some work a second time as result of the delayed processes. 

Some organisations, who have a savvy and proactive IT infrastructure team, may be lucky enough to have some protections (e.g. small applications monitoring network connectivity) allowing these failure events to be recognised and corrected. Some applications may even use the STEP REST API to do this, and subsequently (spoiler alert) be able to reenable the endpoints by sending a HTTP request to this API. At the Client where we had this problem, we had no such protection, and even the ESB team were struggling to support us with these events (always something with a higher priority). It got to the point where the frequency and the negative impact on the reputation of the system and our team as a whole was so great that we decided to do something about it – bottom up change for the win!

In response to the go ahead from the Team Lead, I set about designing some configuration in STEP to allow us to monitor and re-enable failed processes.

The requirements were as follows:

  • Process runs autonomously without need for manual attention
  • Can be turned on and off per Endpoint
  • Frequency of monitoring can be configured
  • Quantity of consecutive attempts to restart can be configured
  • Notifications can be sent and the recipients of these notifications must be configurable.

I’m sure the STEP Consultants among you will be able to put together a solution that satisfies these requirements. Given the emphasis on getting something working in a short period of time, I threw together a conceptual design using Entity Object types (Hierarchy nodes and the “Monitor” itself) and a handful of attributes. Each Endpoint had a corresponding Entity Object where it was possible to configure the necessary details (frequency of monitoring, max consecutive retries, notification recipients, Active indicator). Then with the use of a a few Search Collections, a Scheduled Bulk Update, a Gateway Endpoint and some JavaScript code, it was then possible to get STEP to self-monitor the status of its Endpoints on configurable timescales (and while I was at it I also included Event Processors) and if any of these processes had failed, there would be a number of retries at a configurable frequency until either the process remained Enabled or the max retry limit was reached and STEP deactivated the Monitor. In both cases an email was sent to the STEP Management shared mailbox either informing us that there was a failure and it had been restarted, or that there had been a series of failures and the Endpoint was still disabled. Job done.

Within a few days of this being put active, the Team Lead (who was the person monitoring the group mailbox for error notifications) expressed his gratitude for the solution I had delivered. It had been a bugbear of his. He had got himself into the habit of waking up early to check the mailbox in order to get a head-start on the Business if one of the Endpoints had failed out of hours. This solution ultimately meant that he was able to sleep easier, knowing that for the majority of Endpoint failures we experienced, STEP would effectively self-medicate and the data flows would, after a short delay, continue to flow across the IT landscape.

A few months later, as part of an upgrade of STEP, some additional configuration was introduced into the IIEP and OIEP functionality, allowing for the user to determine whether automatic re-enablement of the IIEP/OIEP should be attempted. We saw this upgrade coming in and I thought that it was going to be a case of end of life for my Monitor configuration. In the end, as a team we did some analysis and determined that actually the out of the box functionality was not able to give us the flexibility that we were now used to. And so the Monitors live on!

Has anyone else had similar experience with these types of process failures? What impact did the failures have on the Business? What did you do to try to mitigate these issues?

PS

This article has been a bit of divergence from the nature of some of the previous ones, mainly because I believe that it’s better that I share  some of my direct experience and approaches to solutions in order to make the whole community a little more open and collaborative. On that note, it occurred to me that it would be great to have resources openly available containing things such as code snippets, self-contained features and perhaps even more detailed business justifications for specific solutions (all anonymised of course). So here’s a teaser, I’m going to further explore this and see what options I have available to me to help to further support and collaborate with other STEP Users and Product Owners. If you’re interested in getting access to this type of content or have suggestions about what it should look like, then please get in touch. It would be good to have some external direction on this idea. Watch this space.