stages of release in FieldMotion

I wanted to talk about how why FieldMotion’s software process makes us one of the most robust and reliable field service management software companies on the market.

Startup companies race to complete features and ship them as soon as possible. We don’t do that, because we are aware that new features are never right. There is always a period of tweaking after a release to match what we built, to how people are actually using it.

How we work is that the development team will have a list of big projects to do, which they’ll work on for a few months. Each of the projects appear to naturally come to a close near six months into development, so we’ll tidy up all the new projects and announce them all as a new release.

The current release candidate will be our tenth official release since we started about 5 years ago, and will have three major products in it: a stronger Dynamic Scheduler (route optimisation for multiple vehicles, departments, etc), strong Xero integration (both Public and Private API versions), and a new Purchase Orders system.

We’ve already planned out the bones of what’s coming for the next six months after this, and we think it will be our biggest and most important release; not because of new products in it (we’re thinking Quickbooks, and multiple languages, for a start), but because we’re mostly dedicating our development resources into making sure the system is scalable (we want millions of people using it), fast even with all those users, never goes down for any reason, and is so simple to use that you’ll wonder why we bother making tutorial videos!

We’re already at the stage where most parts of the system are redundant, so if a server or two (or even a full datacentre – hello Digital Ocean San Francisco I’m looking at you…) goes offline for a while, we can reroute people to different areas that are still up and running. In most cases, there won’t even be a pause (like when SF went offline, and there wasn’t even moment where anyone noticed but our development team), but sometimes we will have short period where parts of the system are offline.

In our own case, how it usually happens is that we’ll see an issue, we’ll solve it in a quick way that’s not perfect but gets the job done for now, and then we’ll solve it in a more permanent way that takes longer but is more robust.

Yesterday, for example, we had an issue where people logging into one particular server in our network were experiencing a large delay in their requests. We looked at this, couldn’t see the immediate cause, and figured the best thing to do at that moment was to increase memory and CPU (we multiplied the number of CPUs by 4, and the amount of RAM by 16 – hey, go big or go home!) to see if that fixed the immediate issue. It did, forcing people using that server (some customer relationship management software users, no mobile workers) to log in again, which was a price that had to be paid at that time to get the problem solved. The more permanent solution in this case is that over the next six month development period, we will be tearing that server type apart into separate “microarchitecture” pieces that can be managed separately, and making sure that each of those new pieces can be turned on/off without affecting people that are logged in. Obviously, this will take time to plan and build, so the quick solution (throw more memory and CPU at it!) was the right answer in the mean time.

Yesterday’s issue took that particular server offline for three minutes (I was timing it) while we upgraded it. Three minutes is better than how long Google Drive was out last week, or how long the entire country of Japan was offline for on August 25th (thanks again, Google), or how Melbourne’s train network was shut-down for hours on July 13th.

We’re at that stage in a company’s life where we’re pretty happy that we have the product we need, and that there is no need to keep on adding more and more and more shiny stuff to it, which is why we’re happy now to spend the next 6 months or more just making sure it is the absolute strongest, fastest, most reliable, and best all-round cloud field service management software.

Iterative development

From our own point of view, and from the points of view of our customers, it is nice to be at a point in our service management software development where we are developing iteratively instead of coming out with new tricks every week.


image: clean, rinse, repeat

While I love it when we develop something new and exciting, it can be nerve-wracking for us for the following weeks, as we try to find out what these new shiny toys have done to our solid, stable, well-tested field service management engines.

New tricks never affect the existing workflow management system customers, because they are on servers that we do not edit except for the occasional bug-fix. The only customers that get to see the new shiny stuff are those that specifically asked for it, or those that asked for other new things such that we were forced to place them on our testing server while we worked on the next release.

FieldMotion is now so flexible that whenever we’re asked to develop anything new, it more than likely ends up that we already are able to do it, so my job is sometimes simply to listen to the request, and then point out how it can done already with the workflow management software.

Sometimes, though, we are asked to do something that’s just slightly beyond what we can do at the moment. It’s never far away; just slightly.

For example, today I was asked if it was possible to do a certain thing with RFID tags. After thinking about it, I suggested a way we could do it that would involve adding maybe only 20-30 lines of code to the field service management app, and yet it opens up our possibilities to yet another broad channel of potential customers.

This kind of thing happens often enough that every six months, we have enough new little tricks that we can release a new version of FieldMotion’s field service engineer software, confident that there is enough newness in there to merit the release, and yet it is similar enough to the previous release that our current clients won’t be shocked at the difference.

Iterative development allows us to “tune” the workflow system to fit better with the customers, knowing for sure that the system already works very well for them, and we’re just adding enhancements, not adding whole new sections that need manuals.

I was explaining earlier today to a new developer that when we create a new widget or page for the field management software, we need to make sure it is as absolutely simple and obvious as possible. He was telling me how he liked the power that the CMS Joomla gives him when he creates a website. Yes, it gives you a lot of power, but it’s at the expense of usability. Every time I have to work with a website that uses Joomla as its engine, I have to learn all over again how it works. That is bad user experience.

When you use any part of FieldMotion’s workflow software, it is straightforward and obvious how it works, whether you’ve been using it every day since you got your account, or this is your first time ever seeing it.

In my old life as a web developer, I would say to clients that “If you need a manual, I’ve built it wrong”. I stick to that slogan and make sure that everything we produce in our system is clean and obvious.

This is also why I love iterative development – instead of developing more and more and more stuff that piles up on the field service software like turrets and walls on a fairy-tale castle, we carefully expand the system just enough to fit the new trick in, and then just as carefully make sure that it is seamless and easy to understand.

How developers are developed in FieldMotion

No two projects are the same. A person coming moving from FaceBook to Google will be lost for the first month. A person moving from Uber to Yahoo would also be lost. The same is true for all tech companies.

When we hire a developer for our workflow management software, we follow a “boot camp” process which gives the developer a grounding in how the workflow system works, and how the code is laid out.

New developers spend a month working in the Implementation department, where they develop digital forms for clients and get to understand how the field management software works from the user’s perspective.

The first month is crucial. New employees are encouraged to spot inefficiencies in how the paperless mobile forms work, and to come up with plans on how they would improve the system if they had the chance. Their plans are added to our internal issues tracker, and worked on over time, either by other developers, or by the new developer.

The next month is spent in handling issues. This can be anything from fixing bugs, to reading logs to figure out what happened at specific times. This month teaches the new employee how the system is spread out, and how we arrange our code. Again, this is invaluable, as you cannot be a good developer if you don’t have a good understanding of the system.

Afterwards, the developer will “gravitate” towards one or more projects to specialise in. Some of our paperless office software solutions overlap (the field service management app and the mobile server, for example), but there’s plenty of room in the system for any developer to “own” a section that they can become expert in.

We encourage our developers to overlap in their knowledge, so that if Alice is out of the office today, Bob can take over if something comes up.

At FieldMotion, we design systems that help you plan your work, but as you can see, we do our best to plan our own internal workings just as rigorously. We intend to be the best workflow management software company in the world, and work constantly towards that goal.

image: we’ve found that pizza helps round them out

upcoming voting feature for clients

We’re trialling a system in-house at the moment which allows us to throw our ideas into a database and then vote on which ones get priority. This way our field service management software development work is not dominated by one or two voices, but is led by consensus need instead.

If this works out well, then we will be pushing it out to all of our workflow management software clients as well.

What will happen is that one client might have an idea of something that needs to be done for their job management software, will explain that in a post, and then apply a few of their votes to that idea. If other clients like the idea, they will also vote for it.

We will monitor the voting carefully, and every week (that’s the plan, anyway!), we will take the top few and get started on them.

This way, you get a stronger voice in what we do, and help to guide the field service software towards something that is even better for your own business than it already is.

In-house, we’re using this to help prioritise work that gets done, so we (the field management system development team) have very clear targets each week, so we don’t get mixed signals (everyone wants their own pet projects now, now, now, and we only have two hands each to work with 😉 ), and so that everyone feels a more collective ownership of things when they are completed.

We are really looking forward to seeing how this works out, and are excited to see what workflow software ideas you come up with yourself.

Why does FM software update so frequently?

One of our clients asked us why we release new versions of our workflow management software app and CRM so often.

image: the Waterfall Model is the general model of software development we follow to ensure that our clients experience only the most stable version of FieldMotion available (Maintenance in the model), unless they choose deliberately to use our testing (Verification in the model) server

We don’t really. Yes, there is always a lot of development going on with the field service management software, but this only filters down to the public once it’s been thoroughly tested. The only exception is when we fix an issue (such as today, we fixed an issue where exporting job data from the job management software, while applying a custom filter based on the customer name, ignored the filter), but the system is really so well-used now that there are no common issues left. Even if we fix 50 issues, you will probably never have noticed any of them before or after, because they’re all to do with using the system in a way that is uncommon.

FieldMotion is cloud-based field service software – we have maybe five different versions of it serving all of our clients. When we release a new “stable version” of the field management system (every six months or so), we start moving clients onto it from the older versions. Because we have many more field management software clients than we have internal developers, this means that the clients will sometimes do something we did not expect, and we then have to fix whatever allowed that to happen.

The stable versions are called “stable” because they are changed as little as possible. In the Waterfall Model of software development, these versions are called Maintenance versions because the only ongoing development they receive from the moment of release are maintenance updates. The only reason we change anything on the stable field service manager software servers would be to correct a bug. If a client insists that they need a new feature that we have not yet released on a stable field services management software server, then we move them to a testing server, because we do not develop new software on a stable field services management server. Of course, we only move them after first making sure that they are aware that the testing server is, by its very nature, not a stable server, and therefore they might experience glitches every now and then. This is their own choice to make. To wait for the new requested feature to be released within six months on a stable field service management systems server, or to jump the gun and move onto an unstable server that will have new features and tweaks almost every day.

With the app, we have “stable” points as well. Whenever we do anything new on the app, it’s added to a completely new repository version. Every repository version that we have has a specific purpose for its existence. For example, repo 73 was created to help speed up a form that a client’s field workers pointed out was slow. We spotted the issue, fixed it, and his mobile workers’ forms now load exactly 54 times quicker (yes, exactly). Everyone that upgrades to a new repository version gets the new enhancements that repository and all the preceding ones brings. This means that if you are on version 62 (optionally disable job ref editing on the app) and we upgrade you to 73 (speed up form-based calculations), then you also get the enhancements and fixes for everything in between.

We are always adding new fixes, features, and optimisations to our service software code, but we only ever upgrade people if it’s necessary (such as to fix a bug which we identify as possible affecting multiple people), or after we take a break at a certain repository version and decide to “rebase” everyone to it so we can have everyone on generally the same number again. Of course, we first put the app through yet another round of rigorous testing, but because later versions are by their nature more tested than earlier ones, we rarely, if ever (I really can’t think of a single case) come across an issue where we’ve broken something that previously worked.

To be honest, we probably update our stable servers much less than larger companies such as Microsoft do. I’m sure you are all familiar with Microsoft’s Windows telling you to please wait while it installs updates? Well, all software needs updates sometimes, but we try to make them in the background so you will never notice them.

So, to the client that thinks we release new versions all the time. No, we don’t. Yes, there are always new features being developed, and issues being addressed, but the only reason you would encounter all of those changes would be if you are a member of our development team, or if you are one of the few who are early-access testers for us.

Scaling up in field service management software

Every now and then, we have a client ask us whether it’s possible that we could just install our field service management software in their datacentre (or even their office). At one time, about five years ago, we might have been able to say “Yes” to that, but the system is now so large that it’s impossible to stick it all on one machine easily (if at all!).

The reason has to do with “scalability“. As more clients are taken on board, the number of people using the system at any given moment increases. It’s a constantly increasing pressure on the system. If we have just one server running everything, then that server is under a constant barrage of requests from people.

Let’s say we need to upgrade that server. Let’s say, for example, that we think it needs more RAM to handle the amount of things it is doing at any one moment. In order to do this, we need to take the server offline, put the new RAM in, and put it back online.

There are two obvious issues here: if you take the server offline, then everyone that is trying to use the server is also taken offline, and if your field service management company needs constant and timely updates, then this must absolutely not happen. And, there is a limit to how much RAM you can add to a server anyway. You can’t just keep adding forever!

This kind of scaling, where you increase the resources (RAM, storage, CPU) on a server, is called “vertical scaling“, and it has limits. eg: the maximum RAM you can easily put in a server (depending on the server) is about 64GB. After that, what do you do?

Instead, we use an architecture called “horizontal scaling”. With this, if you have a server that is running out of resources, then add another server.

In the beginning, as I said, we were on a single server, while we were getting the fundamentals built and figuring out what needed to be done to stretch our wings. The first decision in stepping into horizontal scaling is the figure out what you can separate out from the monolithic block of code. What makes sense to have on its own somewhere else?

Field worker software is the same as all other software – it has a logic part and a data part. It’s very easy to move a database off one server and onto another, and some databases even have at least some form of horizontal scaling built in.

MySQL, for example, which we use for some of our data, can be spread out in a “master/slave” topology, where one or more databases act as the “master” databases (you write to them), and then a group of others act as “slave” databases (they copy data from the masters, and you read from these). This is a good step towards horizontal scaling, but not perfect, because there is still some human intervention needed at times in order to set up the servers and tune them so they share fairly. With MySQL, scaling is a combination of vertical and horizontal. To add more storage, you need to add storage to each server in the cluster (unless you have set up “sharding”), and to let the cluster handle more connections easier, you need to add more slave servers.

CockroachDB (CRDB) is another database system that we are looking at which has horizontal scaling built right into it from the beginning. With CRDB, if you find you need to scale, either storage-wise or speed-wise, you simply add another server to the cluster. It’s not compatible with MySQL, though, so you can’t simply drop one and use the other.

We also use MongoDB with our field service engineer software, which is a “NoSQL” database. We use that for storing files which need to be accessible on multiple servers. MongoDB handles horizontal scaling very well, using an architecture called a “replicated sharded cluster“. Basically, it splits large files apart into smaller chunks, and makes multiple copies (three is the recommended minimum) which it keeps in replica sets. The sharding and replicating is completely automatic once set up, and if something goes wrong reading any chunk, then the system automatically figures out where it can get a working copy. This works well. Recently, our hosting provider had an issue with power supplies in a datacentre and a load of servers went down, but because of how we spread our system across multiple datacentres (and countries), this had absolutely no effect on our clients at all. No-one noticed but the dev team who were monitoring everything.

After the database, the application itself is examined, and broken down conceptually into discrete units. The simplest example in a field service management system is that there is a part of the system which the mobile devices talk to, there is the customer relationship manager section, and there are the various report generators, emailers, etc. Each of these can be carefully separated from the main monolithic block and put onto its own server.

To separate out the mobile-facing part of the application, for example, we made a copy of the main application that we then called the “mobile server”, changed the mobile device code so it spoke to the mobile server and not the main server, we made sure that everything that gets sent to the mobile server (data, files) is available on the main server transparently and immediately, and then we simply removed all the code on the mobile server that wasn’t strictly about handling mobile app requests. When we repeated this on the main server (removing all the mobile app request handling code), we now had two separate servers, each performing distinct services, and each of them requiring less resources to stay fast and user-friendly.

There is the added bonus that when you break a monolith apart into its logical sectors, new developers don’t need to learn the entire system – they can just learn the part you put them on. In FieldMotion, our developers all have overlapping spheres of knowledge, so that if there is a question and I’m not around, then /someone/ will know the answer.

Another thing to be careful of is to make sure that when you break a system down to its parts, you are doing it in such a way that the new standalone services can be replicated. For example, we try to have at least three of everything, so if someone tries to access a server and it’s down or slow, then a backup server is ready to take up the slack. It’s not just databases that need redundancy. You need a few of everything.

Creating a Microservice

When developing a large application, it is sometimes necessary to split off well-defined subtasks so that they can be run on their own servers.

One example already developed in FieldMotion is our PDF microservice. As we started to grow, it became obvious that PDF generation was a bottle-neck – it takes a while to generate a PDF, and while that is happening, it slows down anything else that happens on that server. The solution here was to move PDF generation off to a separate server dedicated to that purpose.

In that case, we used an asynchronous queueing system. We will use a simpler synchronous system for this article.

The problem I’m currently solving is that …well, it doesn’t really matter what problem we’re solving, really. What matters is the details of how to handle authentication, receive the request, and send a reply.

How does a microservice work?

In our synchronous solution, the short answer is that we accept a request, and return a response.

The request must have some form of authentication and include details of the action to be performed.

The response should include a status such as whether or not the action was successful, and whatever errors were encountered along the way.

The authentication method we will use is called HMAC, which is defined simply as hash(key || hash(key || message)), where “hash” is a hashing algorithm (md5, sha1, etc.), “key” is a shared secret key, and “message” is the request, converted into a string. The double hash might seem a bit pointless, but apparently there are security issues with single-hashing, but no-one has yet found an issue with double-hashing.

Don’t allow the same request to be run more than once. If a request is repeated, it is possibly someone trying to hack. To stop repeated requests, in the microservice server, simply keep a note of all hashes that have been seen, and check them before running the message’s request. So if a server sends a request, and then sends the exact same request again, the microservice simply checks its notes, sees that it has already noted that MD5, and can safely ignore it.

With a busy server, this list of hashes can get large quickly, so it’s good to clear out any old hashes every now and then to keep the list small.

But then how do you stop someone from repeating the same request in five minutes that they just sent? Simply add the current time to the message before hashing. On the microservice, if the time in the message is more than 60 seconds ago (for example), then ignore it, as it’s probably a repeat. and if it’s not a repeat? 60 seconds to send a request between two servers in the same datacentre? You have other problems…

There is one further question that needs to be answered. Let’s say that someone manages to hack the requesting server and gets a copy of the secret key? What do we do? Let’s say we have 50 other servers that use that same microservice. We can’t afford to simply change the key on the microservice server, as that will cause all 50 of the servers to suddenly stop working!

The answer here is to have a list of secret keys on the microservice server. This way you can simply disable the one that’s been compromised, and the rest will still work. Just give a different shared key to each server, and a small identifying key name so the microservice knows which key to check against.

So, the sequence we now have is

On Requesting Server

  1. Generate a message that describes the request.
  2. Add current time in seconds to the message.
  3. Generate a hash by running HMAC with the message and the secret key.
  4. Send message, hash, and key name to microservice.

On Microservice Server

  1. If time in message is more than 60 seconds ago, fail.
  2. Check local notes of recently run hashes. If the received hash is in the list, fail.
  3. If the secret key identified by the received key name doesn’t exist, fail.
  4. Generate a hash by running HMAC with the message you received and the secret key identified in step 3.
  5. If the generated hash is not exactly the same as the received hash, fail.
  6. Note the hash so it is not run again.
  7. Decode the message and run the embedded request.
  8. Return the result of the action to the requesting server.

Creating a micro-service server using the queuing method that I mentioned for the PDF solution is more complicated. I’ll write about that kind of server next time it needs to be done here.

Dynamically Updating App Scripts

As any person who develops apps will tell you, the most painful part is when it leaves your laptop and is uploaded to the various app stores. In the case of Android, the Google Play Store takes an hour or two for the app to propagate, but in Apple\’s case, the average wait is a whole week.

This can be incredibly annoying; especially when your upload is an emergency fix for something which could potentially cause problems. We\’ve experienced it, and no amount of cajoling, praying, or bribery will make it go any faster.

To alleviate this problem, we\’ve come up with a new method that will leave the propagation in our own hands, meaning that there is no waiting at all – when we release a new update, it is downloaded to our app within seconds.

Not only that, but the downloaded code is actually more secure than if it were part of an app bundle.

Also, we can target /who/ gets the updates. We can target, for example, a specific company, or a development team, or a specific person or device.

Ok, so how? I\’m not going to share the actual code, but will describe the method we use, so any development team can replicate what we\’ve done.

Well, we start out with a tiny “core” app, which is distributed via the app stores. This core rarely needs to be updated, so it doesn\’t matter if it takes a week or a month for it to get through.

Upon first start-up, the core downloads a “login” script from our mobile server. This script simply shows the login screen and lets a person log in to their FieldMotion account. After logging in, the script then downloads user-specific scripts, and checks constantly from that point forward for any new scripts.

The trick is in /how/ this is done.

Scripts are downloaded over HTTPS, so they are not vulnerable to man-in-the-middle attacks. The login script is generic and can be downloaded by anyone at all, so it doesn\’t matter if it can be read, but for the main app scripts, one step in security is in making it difficult for potential hackers to read your source code. Using HTTPS means that hackers cannot simply listen to your WiFi and copy down what is downloaded.

When the scripts are downloaded, they are then stored in the app browser using localStorage. Because Android and iOS compartmentalise apps, one app cannot read the localStorage of another, so this is one of the most secure ways of storing your scripts so they cannot be read even if someone has access to your phone.

Once a person has logged in, you can then send them the scripts that are specific to them, depending on the groups they\’re in, etc.

To keep their scripts up-to-date, you can set up a long-poll which constantly checks for new scripts to download. In the case of people that are using “stable” code, you can change this to a relatively delayed (every 30 minutes, for example) short-poll because it is less resource-intensive on the server side, and stable code doesn\’t need to-the-second updates.

Another benefit to this method is that you no longer need to compile your code for each phone. It can take a few minutes to install new code on the Android or iPhone. This can be very annoying if it\’s just a one line change you want to test. With this method, though, you could do your development on-line using a code editor such as Code Mirror, and have it pushed to your test phones as soon as you click Save. If happy with the code? Just push it to a release group for your users.

To summarise, the main benefits of developing your apps this way are:

  • Dynamic upgrades on /your/ schedule
  • Code is more difficult for hackers to analyse
  • You can decide who gets what scripts
  • No compilation needed anymore – just test and release

 

[divider scroll_text=””]