Imagine you’re a field service worker, constantly reacting to daily issues, almost like a police officer, instead of criminals and creeps. Electricity and uninformed, angry customers are your enemies. Not to mention being called after a service activity because the next shift is confused about what work is left. Instead, good customer service and seamless restoration are our paths to Nirvana.
Bringing this challenge to enterprise-size customers who traditionally have quite complex information systems and solutions. Technological or not… The friction created by sharing incorrect or misleading data is harmful, especially in the Utility industry for field workers who are skilled personnel that have the pleasure of informing customers of their estimated time to restoration.
If we are to keep skilled field workers busy with the duties that provide the most value, then why not share status to a customer app. The backend systems that predict restorations are complex and need to bring the confidence of a crew member. Automating customer service and field crews are a wonder of integration, assigning issues based on the crew’s current locations for proximity, skillsets, and past issues from specific devices.
Given that these backend systems matured and have been in place for a long time, rearchitecting for efficiency and scale must be done carefully. In some situations, rearchitecting some of these systems is not realistic when weighed against the effort required with the value it generates. These systems and processes are behemoths.
On the other hand, mobile devices are convenient as the crews walk into the customer locations, update status restoration times, and maintain excellent two-way communications between back offices and the outage locations. Then there are some unique situations where mobile devices become particularly handy. The personnel’s safety and security are vital issues when the crew walks into the site. Field jobs are dangerous, so they will need immediate assistance unknowingly when they run into emergencies. If crews are suspicious of a given location and potential emergencies that could arise after entering the premises. Maintaining effective real-time communication with backend systems is a must in these situations and others where personal safety is a concern.
Architecting software systems for these emergencies, especially those where crews are suspicious of a possible crisis, would include maintaining infrastructure just for these use cases that rarely happen. However, they need to be handled with the highest priority when they do happen. They fall in those low probable, very high impact categories. These threats pose an exciting quandary to architects to balance costs and idle infrastructure. Luckily there is a SERVERLESS architecture pattern as the cloud came into the foray. In our case, certain AWS services came to the rescue making it truly serverless, a perfect choice to strike that balance. Pay per use model with little/no operational overhead but achieves the desired communication needs.
- Data Persistence – AWS DynamoDB
- Mobile Authorization and Authentication
- Compute – Lambda
- State Machine – Step Function
- Push Notifications – SNS
Creating an intermediate data store in DynamoDB for tracking outages is an obvious choice to help with a flexible schema, catering to versatile app feature delivery and changing business needs.
- Lambda is the compute layer to perform updates and retrievals from DynamoDB.
- API Gateway with RESTful interface for app integration
- A corporate Identity Provider like Active Directory could federate User Authorization and Authentication using Cognito.
- Amplify made the app developers’ lives easier by providing Mobile SDK libraries for frameworks we love, like Google Flutter, to integrate with various AWS backend services, including the auth from Cognito.
- DynamoDB streams capture changes in near real-time from the app and pass them to the legacy backend and vice versa.
- SNS push notifications took care of notifying the app users as changes happened from the backend. However, a core guiding principle for this architecture remained to stay serverless.
The Alerts Challenge
Typically, there are two kinds of situations in a Crew that would need to trigger an alert. First is an immediate emergency. 911 to all surrounding personnel, help immediately. Once a crew member pushes the emergency button on the app, the backend receives an alert in near real-time using the DynamoDB stream and updates the backend via Kafka. Once the backend receives the update, the manual process ensures a remedial action, typically a person following up with the crew offline, arranging appropriate response action. REST call from the app updates the alert status on the outages table using lambda.
Other situations are a bit more tricky. For example, the crew member is uncertain of personal safety until she walks into the site. If it looks suspicious from the outside, before even getting in, there will be a mechanism that allows the crew to automate the alert, “If I am not canceling this alarm in the next 10 minutes, send me help.” The alarm will trigger a signal in the backend to send help to the exact GPS coordinates.
Here are the critical requirements that this situation demands. One, timers on mobile devices are less reliable regarding battery capacities, signal strengths, etc. It is ok to send a false alarm, but you should never miss responding to the notice. Two, once the crew (in most cases) recognizes there are no safety issues at a given location, the user should be able to disable the timer. However, it should be the same crew member who can disable it but not anyone else.
We found a solution for these requirements while remaining serverless in Step Functions. We could have maintained a timer on the app, but that doesn’t solve the first requirement when a battery runs out or the app can’t invoke a REST call due to a lack of internet connectivity once inside. Solve for the second requirement is met by making the step function check the status flag before triggering the alert to the back-office systems. The app validates the user by making him sign in again if the user desires to cancel the alert. If the user loses signal or the battery runs out, it will send a false alarm, which is acceptable given the situations are rare (for rare cases in the first place).
In summary, serverless architecture has enabled us to protect lives by preventing catastrophes and bringing help as soon as possible. These highly available and redundant services would have prohibitive cost implications for infrastructure to support the feature for a single-use matter. And let’s not forget the ongoing licensing, maintenance, security, electricity, environmental impact, and personnel to maintain the infrastructure. Serverless and pay-per-use are superior and faster than on-premise solutions.