Scaling up in field service management software

Every now and then, we have a client ask us whether it’s possible that we could just install our field service management software in their datacentre (or even their office). At one time, about five years ago, we might have been able to say “Yes” to that, but the system is now so large that it’s impossible to stick it all on one machine easily (if at all!).

The reason has to do with “scalability“. As more clients are taken on board, the number of people using the system at any given moment increases. It’s a constantly increasing pressure on the system. If we have just one server running everything, then that server is under a constant barrage of requests from people.

Let’s say we need to upgrade that server. Let’s say, for example, that we think it needs more RAM to handle the amount of things it is doing at any one moment. In order to do this, we need to take the server offline, put the new RAM in, and put it back online.

There are two obvious issues here: if you take the server offline, then everyone that is trying to use the server is also taken offline, and if your field service management company needs constant and timely updates, then this must absolutely not happen. And, there is a limit to how much RAM you can add to a server anyway. You can’t just keep adding forever!

This kind of scaling, where you increase the resources (RAM, storage, CPU) on a server, is called “vertical scaling“, and it has limits. eg: the maximum RAM you can easily put in a server (depending on the server) is about 64GB. After that, what do you do?

Instead, we use an architecture called “horizontal scaling”. With this, if you have a server that is running out of resources, then add another server.

In the beginning, as I said, we were on a single server, while we were getting the fundamentals built and figuring out what needed to be done to stretch our wings. The first decision in stepping into horizontal scaling is the figure out what you can separate out from the monolithic block of code. What makes sense to have on its own somewhere else?

Field worker software is the same as all other software – it has a logic part and a data part. It’s very easy to move a database off one server and onto another, and some databases even have at least some form of horizontal scaling built in.

MySQL, for example, which we use for some of our data, can be spread out in a “master/slave” topology, where one or more databases act as the “master” databases (you write to them), and then a group of others act as “slave” databases (they copy data from the masters, and you read from these). This is a good step towards horizontal scaling, but not perfect, because there is still some human intervention needed at times in order to set up the servers and tune them so they share fairly. With MySQL, scaling is a combination of vertical and horizontal. To add more storage, you need to add storage to each server in the cluster (unless you have set up “sharding”), and to let the cluster handle more connections easier, you need to add more slave servers.

CockroachDB (CRDB) is another database system that we are looking at which has horizontal scaling built right into it from the beginning. With CRDB, if you find you need to scale, either storage-wise or speed-wise, you simply add another server to the cluster. It’s not compatible with MySQL, though, so you can’t simply drop one and use the other.

We also use MongoDB with our field service engineer software, which is a “NoSQL” database. We use that for storing files which need to be accessible on multiple servers. MongoDB handles horizontal scaling very well, using an architecture called a “replicated sharded cluster“. Basically, it splits large files apart into smaller chunks, and makes multiple copies (three is the recommended minimum) which it keeps in replica sets. The sharding and replicating is completely automatic once set up, and if something goes wrong reading any chunk, then the system automatically figures out where it can get a working copy. This works well. Recently, our hosting provider had an issue with power supplies in a datacentre and a load of servers went down, but because of how we spread our system across multiple datacentres (and countries), this had absolutely no effect on our clients at all. No-one noticed but the dev team who were monitoring everything.

After the database, the application itself is examined, and broken down conceptually into discrete units. The simplest example in a field service management system is that there is a part of the system which the mobile devices talk to, there is the customer relationship manager section, and there are the various report generators, emailers, etc. Each of these can be carefully separated from the main monolithic block and put onto its own server.

To separate out the mobile-facing part of the application, for example, we made a copy of the main application that we then called the “mobile server”, changed the mobile device code so it spoke to the mobile server and not the main server, we made sure that everything that gets sent to the mobile server (data, files) is available on the main server transparently and immediately, and then we simply removed all the code on the mobile server that wasn’t strictly about handling mobile app requests. When we repeated this on the main server (removing all the mobile app request handling code), we now had two separate servers, each performing distinct services, and each of them requiring less resources to stay fast and user-friendly.

There is the added bonus that when you break a monolith apart into its logical sectors, new developers don’t need to learn the entire system – they can just learn the part you put them on. In FieldMotion, our developers all have overlapping spheres of knowledge, so that if there is a question and I’m not around, then /someone/ will know the answer.

Another thing to be careful of is to make sure that when you break a system down to its parts, you are doing it in such a way that the new standalone services can be replicated. For example, we try to have at least three of everything, so if someone tries to access a server and it’s down or slow, then a backup server is ready to take up the slack. It’s not just databases that need redundancy. You need a few of everything.