A couple weeks ago Sulzer announced their new wireless condition monitoring system, Sulzer Sense. The service connected new hardware devices created by Treon to the existing SAP customer portal. Our team was responsible for designing and creating the backend to Azure. Naturally, when we are dealing with hardware under development, there came moments when we had to be very agile. But Azure as a platform proved once again to be very flexible.
The rest of this story concentrates on the part I know best – Azure. But naturally very big praise goes also to our SAP CX (E-commerce including Hybris that has been renamed lately) and mobile app teams together with analytics experts for making Sense (pun intended) of the raw…or a bit processed data. And thanks also to all the people at Treon and Sulzer – this was a fun journey together!
TL;DR
Below is the architecture which the rest of the blog discusses. And last week at SAP Finug event the team received the best possible praise from the customer: “If you are looking for partner who can handle SAP ERP, Hybris and Azure, Bilot is the natural choice.”
Solution is built on Azure PaaS & Serverless components which can be easily scaled based on usage. Actual compute section (Functions) scales automatically. And what already proved the scalability was the journey of implementing the solution. There is no maintenance overhead of any virtual machines etc. Some scaling options require currently manual operations, but on the other hand, they are related to growing service and easily monitored metrics & alerts. Naturally, this could be automated too, if the system load would be more random.
If you are interested in this kind of solution, we’d be glad to discuss with you here at Bilot.
Developing architecture
In the beginning there were multiple choices to be made. Devices are not continuously streaming data but sending samples periodically or when an alert threshold is exceeded. Naturally we had a good idea on the number and size of the messages that the devices would be sending, but not actual data at hand.
Our main principle was that the data processing should be as close to real time as possible – even though this was not a direct requirement. Scheduled frequent runs would have been good enough. But then we would have had to think about how to scale the processing power up when more devices and data needs to be processed. So we decided to use serverless Azure Functions for the actual processing and let the platform scale up as needed and have one less thing to worry about.
The devices communicate to Azure through a separate gateway node that is connected to Azure IoT Hub service. In the first phase device provisioning was manual, but that is currently being automated in the second phase. Also, as only the gateways are directly connected to Azure, the number of gateways is not growing as fast as number of devices behind.
Devices send different types of messages, so next step was to separate these with Stream Analytics service. We had decided to use Cosmos DB as a temporary data storage, because that integrates well with Stream Analytics and Azure Functions and gives us a good access to debug the data. We were unsure in the beginning that would we actually need Cosmos DB in the final solution, but this proved to be exactly what was needed…
Surprise!
When we started receiving the data from devices under development, the first thing we noticed (and had misunderstood from the specifications) that the messages are actually fragmented – one whole message can consist of tens of message fragments. This is the reason why I, an old grumpy integration guy, insist on getting real data as early as possible from a real source – to spot those misunderstandings ASAP. But due to the nature of this development project this just wasn’t possible.
The Wirepas Mesh network the devices use has a very small packet size and the gateway just parses and passes all fragments forward. There is also no guarantee that the fragments are even received in order.
This was a bit of a surprise – but finally not a problem. As we had decided to stream all data to Cosmos DB for debugging, we already had a perfect platform to check and query on the received fragments. We could also use out-of-the-box features like Time to Live (TTL) to automatically clean up the data and prevent id overruns. Some test devices sent data much more often than planned, so this gave us a good insight to system loads. Thus we also used a couple of hours to manually optimize Cosmos DB indexing, after which the request unit consumption (read: price of Cosmos DB) dropped about 80 %.
Now that the data was streamed to Cosmos we tied Azure Functions to trigger on each received message – which turned to be every message fragment. What would be the cost implication on this. Well – only minor. First, we optimized the calculation in a way that we start processing only if we detect that we have received last fragment of a data packet for those data streams where only full data is useful. So most of the processing stops there. Actual calculation & storing of the data happens only for full valid packages. When we measured the used time and memory for the operation, we saw that 1 million message fragments would cost about 0,70 € to process – true power of the serverless! So we are basically getting compute power for free, especially when Functions offer quite a lot of free processing each month…
Naturally we had to add scheduled retries for those rare scenarios where packet ordering has changed. There we used Azure Storage Queues to store events we have to recheck later. Retries are processed with scheduled Azure Functions that just retrigger those messages that couldn’t be completed with first try by updating them in Cosmos DB.
In principle, this work could also be moved to IoT Edge as pre-processing there before sending to Azure would lower the message counts on IoT Hub a lot. In this case, hardware development schedule didn’t make it possible to place IoT Edge on gateway devices on the first phase.
Long term storage
Long term storage was another thing we thought a lot. Cosmos DB is not the cheapest option for long term storage, so we looked elsewhere. For scalar data Azure SQL is a natural place. It allows us to simply update data rows even if scalar data fragments arrive with delays (without having to do the same fragmentation check that stream data uses) and also make sure that some rolling id values are converted to be unique for very long term storage.
During development we naturally had to load test the system. We generated a stored procedure to simulate and create a few years worth of this scalar data (a few hundred million lines). Again this proved to be a great reason to use PaaS services. After we saw that our tester would run for a couple of days in our planned database scale, we could move to a lot more powerful database and finish the test data creation in a few hours – and scale back again. So we saved two development days and finished the data creation in one night by investing about 30 euros.
Part of the data is just a sequence of calculated numbers by the devices themselves. These were not suitable for storing in relational database. We could have used Cosmos, but as discussed, the price would have been probably a bit high. Instead, we decided to store this kind of data directly to Blob Storage. Then it was just a question on how to efficiently return the data for the customer requests. And that could be easily solved with creating a time based hierarchy on the blob storage to easily find the requested data. At least so far it works perfectly even without using Azure Search, which is kept as an option for later use, if needed.
Both long term storages can be easily opened up for data analysts and machine learning use cases.
API for users
So far we have discussed only what happened to the incoming data. But how to expose that to the end users then? In this case we had the luxury of already having a customer portal ready, so all we needed to do was to create APIs for the backend to get the data in a secure way. This was accomplished with API Management in front of more Azure Functions. API Management is also used as a proxy when we need to query some data from the portal’s data API to give serverless Functions platform a single outgoing IP address.
All the database queries, blob searches and different calculations, unit and time zone conversions are executed in the serverless platform, so scaling is not an issue. Functions may occasionally suffer from cold starts and have a small delay for those requests, but there is always the Premium Plan to mitigate that – so far there has been no need for this.
We have also planned for the use of Azure Cache for Redis if we need to speed up data requests and see from usage patterns that some data could be prefetched in typical browsing sessions.
Conclusions
What did we learn – at least we once again confirmed that using platform services is very suitable for this kind of work and leaves so much effort to more important things than maintaining infrastructure. It’s just so easy to test and scale your environment, and even partially automated.
Also the importance of tuning Cosmos DB was clear – that couple of hours tuning session was easily worth a few hundred euros per month in saved running costs.
And that the architecture is never perfect. New services appear and you have to choose the tools you use based on one point in time balancing between existing services and if you dare believe that some services come out of preview in time. But the key here is to plan everything in a way that can be easily changed later, if – and when – more efficient service comes available.
(PS. And really, always prioritize getting real data out of the systems you are integrating as early as possible if you can)
