REST API (3): Tug of War
Every engineering design effort is an exercise is juggling opposing forces. In part 2 we looked at carving the entirety of the state into resources. This time we’ll dig deeper and into more things one must think about when designing good APIs based on transferring “current or intended states” of resources, along the following lines:
Support for varied and, perhaps, unknown clients, use cases
Transactions: need to accomplish multiple things together or none at all
Overall runtime performance, caching
Resilience to failure, graceful degradation
Authorization and security
Who’s the boss: where’s the business logic?
Evolution and maintaining compatibility
Tightness of coupling
Effort spent
Client variation
In part 2 we looked at how non-trivial figuring out what types of resources to expose is for just a single, known client, for a single concrete example. Whatever approach we decide to take will have profound effects not only on communication but also on what code needs to be written, how much and, where it lives. Differences in code sizes and complexity are huge and cannot be ignored. Now suppose that a different kind of client, with different needs was considered instead. Do you think that you would arrive at exactly the same decision hon ow to carve those resources? I don’t think so. Some may be similar. For some resources, we may not need to include some details but may need r benefit from others that weren’t included in the original approach. I’ve observed the following approaches to address this, in no particular order:
Fuse resources and data so that all data that any client may need is always present. Clients are supposed to ignore the extra data they don’t care about. This yields larger state representations with potentially less of them needed to cover any particular use case.
Split resources into fine-grained ones, each either needed for just one client or reusable for both. This yields smaller state representations but requires more of them to be gathered for any particular use case.
Reimagination of resources to best suit each client. Presumably, this yields the best outcome for the clients but usually requires more effort and requires one to be very careful about maintaining consistent system behaviour.
Dynamically adjusting resources to client’s requests. Additional control data is passed with each request that specifies the needs of the client better, such as what to include, embed and/or exclude in current state retrieval and/or what is included in the intended state. This affects caches (shared and private, local) as transferred state representations differ not only based on what data was included or excluded but also how those inclusions and exclusions were communicated themselves: caches will treat “name, email” and “email, name” inclusion specification in URLs as different unless they are specifically built to understand this.
Here come the questions for you to apply to each of the above approaches:
How much does the server have to change? How about the clients? How much work is it for developers? Are you going to do BFFs?
How much work does the server need to do? How much of it is wasted because the particular client won’t care about it? How much overhead is incurred in basic request handling, reauthentication, reauthorization, etc. before the core processing even begins?
How much communication overhead (data size, latencies/time) is introduced?
Is that approach helping or exacerbating the N+1 issue (needing to issue N additional requests to get the missing data for the main 1 initially retrieved)? Is it worse than that – could the client need an additional “M” for each one of those “N”? Does every communication also include data not needed?
How does this affect the ability to accomplish all or nothing of multiple things (transactions)?
How tightly coupled do your clients and servers become to each other?
Bonus questions for the dynamic approach:
Are you aware of any standards for this or you’re going to wing it yourself? Hint: I know of some but they aren’t REST: OData, GraphQL.
Is your approach or those standards RESTful or does it begin to look like RPC?
Is there existing code, library, framework to help you do this efficiently? Would you have to do this consistently for each resource type? Why (not)?
You made your decision. You built it. It works beautifully and everyone is happy. So happy, in fact, that your API is a market success. Because of that another, different appears, soon followed by ten more. Even more appear, each evolving in different ways to not only don’t control but don’t have time to pay attention to fully, even if you were permitted to. Some are custom mobile apps, others are made by your enterprise customers in evolving integration with their differing systems.
What will you do now? Are you at risk of this at all? Does it matter? Does it help to start on the right foot or is it OK to keep rebuilding from scratch … or build “BFFs” for everyone? Do you think that “the right foot” can exist at all? Maybe there’s only an approach that’s flexible enough to accept all kinds of “feet”?
All or nothing: Transactions
Say you want to move funds from one bank account to another. That’s one action affecting (at least) two resources. I’m aware of the following ways people take:
Request all of them together, thus transferring the states of all (affected) resources at once.
Request all of them individually but in an extra context that controls their joint application, thus needing to have the concept of that “context” and the ability to apply it when ready.
Apply them individually, using client-driven “undo” to roll back in case of a failure.
While “transactions” are often seen as being needed only for actions that cause changes, they are sometimes also needed to gather multiple chunks of data consistent with each other. Given that, think about:
What effect would option (1) have on your design, such as carving the resources? Do you know which combinations are needed so that you can directly support those? How does that get impacted by client variation and/or how does that impact caches?
Is option (2) RESTful at all considering “each request contains all of the information necessary for a connector to understand the request, independent of any requests that may have preceded it” (Fielding 5.2.2)? How can caches “know” that a later step failed and they need to revert their view? Better yet, how can they isolate any state updates until they are committed?
Who is in control in option (3)? Can that even work in face of the communication failure? What about transaction isolation – preventing others from seeing the effects until it is complete? How about only undoing the changes made to the intended parts of the state, leaving concurrent changes to other parts, made by other clients, untouched? Or should that happen at all, if those clients depended on the interim state being rolled back? How can they be enabled to react to this and what would they need to do?
Steady vs Transitional States
Most often API designers try to imagine representational states as steady and only modified RESTfully by client requests. Say that your API is managing other computers and one thing it needs to be able to do is to restart them. Which option would you pick?
Have the client update the “state” property from “started” to “stopped”, wait a little, then update it again to “started”? What happens if there are additional interim states? What happens in a new interim state is introduced?
Introduce a server-managed transitional state “restarting”. The client would only update the state from “started” to that “restarting”, and the server would run its own workflow through all interim states accordingly. What needs to be considered for other state updates coming in while the server is still running that workflow?
Introduce a new “computer restart” resource and create it as linked to the computer to be restarted.
Performance & Caching
The performance of a design isn’t affected just by how long a server spends serving a single request. The list is closer to the following and must consider N+1 issues:
How long do all the servers spend serving all requests needed for the use case?
How much time will the data spend in transit and/or network latency?
How much can caches help with the above?
What will be the cache hit rate given state representation variation? Considering that, is it more helpful than it is cumbersome to enable and work with?
Is (shared) caching even permitted due to access control requirements?
Do freshness and consistency requirements allow caching at all?
What kind of work does the client need to aggregate, filter, and transform everything needed for the use case, perhaps across multiple servers?
How (sub)optimal the code of both the server and the client is considering how much time developers have available to work on optimizing it vs. just dealing with complexity?
Achieving good performance implies doing well in each of those “categories”, yet working on one takes development time away from and makes another more complex. Addressing N+1 issues requires either development-time or runtime (dynamic) specialization for each use case, reducing the hit rate on caches, if any. Increasing the cache hit rate requires standardization across individually cacheable resources, which brings on N+1. Hand-coding specialized resources multiples the codebase in ways that are hard to make consistent as they aren’t quite duplicates but do often do similar things.
Resilience
Failure is always an option. What do we do when it affects only a part of the whole thing? For example, we may be unable to retrieve a small part of the resource state. Should we fail the whole thing or respond with what we were able to get, apologizing for the rest? What’s the status code for that partial response? It isn’t 206, that’s for “Range” requests. How do we “apologize” for the missing data so that it doesn’t confuse anyone, such as caches and clients expecting conformance to media type structure (response format)? Would that apology be a state representation at all?
One way to address this is to divide resources into sufficiently fine-grained units. This way the client can ask for many of them and receive separate success indications for each, allowing for graceful degradation. Think, however, what does that do to performance, N+1 issue?
Authorization
Similar to failures, lack of authorization can prevent any part of a response or a request to be handled regularly and would require similar considerations as noted in the previous section on resilience. There are additional considerations and oddities here. While failures are typically not something to hide, lack of authorization often requires the part of data, sometimes select values in a collection to be omitted as if they aren’t there at all. This is “success” when it comes to access control. Does this seem to make the problem go away? Well, not so fast. Think of a roundtrip scenario:
Client gets a state representation with some values omitted from some multi-value property.
Client adds and/or removes some values it saw there, never realizing the existence of the omitted ones.
Client sends the updated representation of the entire resource to the server as the “intended” state.
What should the server do? The new intended state won’t include the hidden values. Should the server remove them and use the new intended state verbatim? Or should it leave them in? Is the new combination of values (intended + hidden) valid or not? If it is invalid and the server indicates that, can that not be used to “phish out” the hidden values? Also, should the server be able to realize what was hidden based on then-effective authorization data and use that in its processing? Would this not violate the requirement for requests to be complete and not depend on previous ones (Fielding 5.2.2)?
Recognizing malicious load
When all requests look alike how does one recognize which one is malicious, perhaps a part of DDoS-style attacks? A common way that crystalized “out there” is to use request rate limiting, perhaps separately for each type of resource (and potentially 0 for some client-resource pairs). This is built on the expectation that regular clients, driven by regular needs, such as serving human interactions or otherwise, have defined behaviours with their own physical limitations. The aim is to translate those limitations to request rate limits.
There is a disconnect, though. Client (and human user) behaviours have limitations that are defined in their own, higher-level use case domain. Think of a single human clerk’s live interaction with a human customer. Consider not only how many clicks but how many trips between the client and the servers could be needed to support that?
Unless each use case maps 1:1 to an API request, we will need to account for more requests than client use case runs. Worse, we will have to account for many different, yet realistic combinations of use cases over time and be reasonably ready for the worst possible case – maximum rate they would cause. This gets especially bad if N+1 issue exists, and usually it does. It multiples any kind of human interaction by a significant factor to arrive to a reasonable rate limit that would not prevent regular operation.
That is where malicious clients come in. They can target the inflated rate limits with intentionally most expensive requests. In a dumbed down example, imagine a singular, joint rate limit meant to allow many simple resource fetches by id, yet issuing requests with complex searches, data transformation and huge responses each time.
How did you solve this problem in your REST API? Did you? Or do you have more important things to pay attention to?
Who’s the boss?
Servers serve representations of the current states of resources. Servers accept representations of intended states of resources. Beyond doing this and, potentially, complaining about (in)validity, if they can, what else do they do? Where is (the code for) that business logic beyond data gathering, validation and storage?
If it isn’t in the servers, then it can only be in the clients, right? Is that good? How do you ensure that all the clients have at least compatible business logic implementation, if not the same? How do you ensure that all the clients have up-to-date code to match, at any time, including if they are running at the time? Will that include polite requests to malicious clients too?
In any case, that brings us to…
Evolution vs Compatibility
What do we have to consider when evolving an API while having existing clients that we can’t change? It is commonly understood that we don’t want to remove what those clients depend on, such as properties/parts of state representations they use when getting the current state or sending the intended state. That isn’t the whole story, though – far from it.
We also don’t want to send additional data they don’t expect in current states or expect that additional data in the intended state. Often people suggest that clients should just ignore what they don’t expect but that doesn’t quite cut it for a number of reasons, e.g.:
Extra data may and did cause buffer overruns.
Completely ignored parts of current state won’t be included in the sent intended state in updates.
There may exist relations between the extra and previously known data that may need to be considered together during updates.
To work around this issue it became common practice to version the APIs so that legacy clients continue to work as they did before, while new clients get to move on. Put your thinking hat back on now:
Do you have to always all the resources (types) to new versions whenever any one changes?
Yes: how much work is that? Would you hand-code all that for optimization purposes or handle that dynamically? What’s the risk? How many versions of your API will you create? Will it be possible for clients to iteratively migrate their code from one API version to another?
No: consider the client following the indicated links to related resources. Should these links be versioned?
No: Client must be able to build the entire URL, including relevant version, knowing that not all versions exist in all resources and may not match.
Yes: How do you know which versions does the client support? Suppose it does and it follows that link. It is now looking to follow the reverse link. How do you make sure it goes back where it came from and not to some “latest version available before the source version”? In other words, how do you ensure a compatible roundtrip? Bonus question – the client stores those ids in its own database. Few API version upgrades later it looks that up and tries to get the resource. By that time it not only can use a newer version but requires it. What does it do?
Let’s throw another wrench in there. Do links include full URLs with hostnames? That sounds friendly as the clients don’t have to build those and can store and reference them at a later time. It is also leverages existing DNS and gateway infrastructure to seamlessly support distributed services. But does it?
Servers and services move, split, merge. New, closer ones may be added, older ones decommissioned. How do you plan on addressing that? Using tombstone proxies, gateways for each decommissioned host? Forcing comprehensive client updates? Do you have a better idea?
Overengineering? (Im)possible?
Am I leading you towards clear, blatant overengineering? It may appear so but that was not my goal. I’m not the one doing this – I’m merely a messenger point out how many things need to be considered and how profound effects making a wrong turn may cause. To design an API based on resource state transfers that will be good for a while requires all considerations that (may) apply. The same is true for any other API approach too, but a set of considerations will be different. Should you consider those alternatives or are you happy with your mastery of this particular tug of war?