The Fintech business has been making massive impacts in recent times in the African ecosystem, and this has contributed to the overall use of technology tools for transaction processing. However, there has been an increased surge in financial transactions over the years that has now seen a lot of consumers gradually shift their transacting patterns from traditional in-branch banking to a more digital approach.
Whilst several entrepreneurs have been able to hop on the fintech transaction digitisation bandwagon, the ability to scale and service customers without major downtimes has been a huge concern over the years.
Here are 7 (being the number of perfection) tips and recommendations for achieving maximum uptime and reliability:
People
Companies tend to go with the 300 Spartans approach when starting off a company and eventually go live and start scaling with this same amount of guys. Unfortunately, this is not a three-day journey nor are you fighting 70,000 enemies that die only once. If a transaction fails today, it is re-tried almost immediately. Getting the right amount of people with the right skill set and mentality cannot be overemphasised as these individuals are the pillars of your organisation and the experts that drive the bazookas and armour tanks that you use to service your customers.
Process
The Spartans had a systematic way of approaching their battles. They would have performed pre-battle rituals and developed fierce-looking, well-trained external appearances towards their battle. Many tech companies today lack the right process from the ideation stage to execution, and there are two parts to it:
Engineering: Take an engineering team in a Gen-Z fintech startup world today, we see the senior engineer playing the role of an Enterprise Architect, and the Product Manager playing the role of a Quality Assurance Engineer. Automatically, this flaws the process of building as there are more short-sighted activities than the overall grand scheme goal of the product. What we then have is a product that has not gone through a rigorous development cycle that can stand the test of time. There are several standard engineering issues that a product team should think about when building, these issues range from application scaling, database scaling, API optimisations, default configurations, unoptimized database tables, and a host of others. With the right engineering process, several aspects of the engineering issues will be addressed and tackled even before go-live.
Issue Resolution: The question I typically ask companies or product teams when they go through issue resolution is, did we learn from that? Or did we just fix it and move on? I believe this is a pretty obvious aspect of process development. They say, “what doesn’t kill you makes you stronger,” but if you look back, did the resolution of that issue make you stronger? Or did it just create another grey area in your system that people will forget about over time? As an organisation, when issues come, you should typically find a solution at the moment to service your customers and create an incident report as well as a newly implemented process that will prevent that issue from occurring again in the future.
The process could be a new engineering task or product feature, or simply a step-by-step guide on dos and don’ts that people should follow.
Monitoring
This is one aspect of achieving 100% uptime that is largely overlooked. Companies don’t just simply invest in monitoring. I mean, why pay someone to just look at a bunch of dashboards and screens for the whole day and not do anything? That’s simply “unrealistic” for some companies, and it only makes sense for big companies and organisations to have a monitoring team. However, what these companies fail to understand is that monitoring is one of the most important parts of a product. Now, some might argue that they have monitoring, but in reality, what they have are “problem announcers.” Monitoring is not being reactive to issues, it’s about being proactive and being able to sense when there’s a potential issue that could occur. This can be achieved by a combination of tools tailored to your products and people who are well-trained with the right skill set. A few tips on things to monitor include:
Database Monitoring: Just look out for slow and inefficient queries, DB table size growth, and resource utilisation.
API Monitoring: Look out for API calls that take the most time in a microservice and optimise them. There are several tools to achieve this, e.g., New Relic, Dotcom-Monitor, Checkly, Uptrends, etc.
Third-Party Systems Monitoring: This is also related to API monitoring, only that this time you might not be able to install custom tools for this. However, you can build out metrics and checks into your application, such as success rate and average response time tracking for the providers you are integrated with.
CPU & Memory Utilisation Monitoring: Sometimes, the underlying infrastructure can be a problem, even after several optimizations of other components, you might still see the server struggling. Thankfully, we have tons of gigabytes in the modern world that can be purchased from several sources to help the application from struggling with resource utilisation.
Network Monitoring: Your VPN server can be the culprit, or even the firewall or router. They could be dropping packets or even have lost connection to the other entity’s server. Monitoring this by configuring alerts for idle tunnels, increased load, or packet drops can help the team become proactive.
Distributed Processing
There are three main parts of distributed processing:
Applications: There’s a reason the micro-service architecture was invented. Please use it! Having one monolith doesn’t help your cause as an organisation. In addition, know when to scale your applications. Tools like Kubernetes and load balancers have been made available to help with the auto-scaling of applications. This helps distribute the load across several instances of a microservice.
Databases: Know when to scale horizontally and vertically! Aside from backing up the database tables and adding more resources, sometimes it’s best to just horizontally scale the DB simply to separate concerns and reduce the load on one DB instance. It’s also similar to the microservice approach.
Alternative Routes: Don’t put all your eggs in one basket if you want to scale! The people you’re connected to or relying on might not be willing or ready to grow as fast as you want to. Get several ACTIVE routes for a single responsibility. I repeat, ACTIVE, not passive. Use them actively and spread the load across them all. This will help you understand the strengths and weaknesses of your providers over time. Don’t just use one main partner and wait for them to fail before you move over to the next person, use them all at the same time. This way you don’t create a bottleneck out of your providers.
Automatic Failovers
This is similar to Alternative Routes earlier stated, however, instead of a manual failover on one provider, you build monitoring tools into your application to help detect when providers are performing below their normal expectations and automatically switch to the next most optimal provider. This will help eliminate potential downtimes from a provider, and the time to manually change a configuration from one to another will be completely eliminated, thus presenting a much more reliable system to your customers. The same can be done on the network layer connections via VPNs. Have several VPN providers, don’t just assume Google can never go down, or Azure can never go down; they are also companies like you.
Penetration Testing
Pen-testing your systems regularly cannot be overemphasised. Not just before go-live, but even after go-live. There have been several cases of hackers performing D-DOS attacks on systems that eventually lead to downtimes. Carrying out white-hat testing helps identify areas of improvement in your system’s security and ways to mitigate them.
Training
You cannot hire all the experts in this world, but you can train individuals to become experts! Training your employees is a very crucial aspect of achieving reliability. The ability to learn new technologies or even learn more about existing technologies currently being used in an organisation helps the employee deliver even more. Don’t just rely on the skill you used to hire the employee as it might have been obsolete in another year or two. A few tips:
-
Invest in online training courses for your employees e.g Pluralsight, Coursera, Udemy, and a couple of them.
-
Carry out physical training for your employees by either sending them for training with organisations whose tools you use or inviting these experts to train your employees internally.
-
Carry out routine cybersecurity training to help build reliable and secure systems.