For 18 months my colleague and I, in a small startup, attempted to use GraphQL to build a highly realtime, cloud-native virtual conferencing platform. We succeeded but we paid a big price for our choice of API design and middleware. In this post, I’ll explain my opinions on GraphQL and associated technologies, and what choices I’m making at the new startup I’ve joined.
As with any API design today, there are quite a few issues I could pick up on. All but one of the worst are essentially solved problems in REST APIs but GQL somehow makes them a whole lot worse.
Before I list the problems though, I want to add a few caveats. I’m writing this from the perspective of small teams - less than 50 engineers. If you’re a massive company with a big engineering team, then use GQL. Go for it. You have the capacity to engineer around the problems and will benefit from GQL’s better aspects.
I’m also writing from the perspective of a complex web app. What I mean by that will become clearer as we go along, but broadly speaking if you have any of the following design requirements you might want to avoid GQL: real-time (<1 second latency) API calls, ABAC or generalised Authz ACLs (Zanzibar style) or highly-cachable data.
So, what are my top issues?
Throughout this article, I am making a comparison to REST APIs, with tools/ORMs like Prisma in mind.
Everyone’s experience of a design system is coloured by the tools they use. I was using Hasura and Apollo, and later switched to Urql.
Problems (1) and (3) are generic - no tool or middleware can solve fundamental design issues in GQL. But (2), (4) and (5) are tool/middleware problems, arguably made worse by the properties of GQL APIs.
The power of GQL is the ability to query anywhere through your graph. Even better, your frontend engineers do not need to understand anything of the backend to be able to make entirely new demands of your API. Sounds awesome!
However, this creates a fundamental problem in GQL. The “surface area” of your API is a vague measure of how many endpoints and models your backend team needs to worry about. With REST, your surface area is controlled by your team - how many endpoints did you add to the API?
With a typical GQL ORM, every model generates 4 CRUD endpoints automatically, plus a linear number of additional interactions for each relation to the model. When you start to scale this up, to 100+ database tables and custom (REST) endpioints and remote APIs, your surface area balloons rapidly. The result is unmanageable complexity.
Your two-person or 20-person team are not going to be able to think through all the possible queries. This means you’re just coding-and-hoping when it comes to whether your API is secure, performant, robust and reliable. It might be fine today and next week, then you hire a new frontend dev and all hell breaks loose when they write a new query.
REST APIs have this one solved. Your team controls the surface area, and can (/should) write tests for endpoints. Changing the API is a contained problem, you can test performance, and a frontend dev needs the backend team to add new API calls (which implies a robust process for testing the new query before using it).
As it stands, Google’s Zanzibar is the gold-standard for Authorization. A few teams are creating their own implementations. But it’s complex, time consuming and arguably overkill for most apps.
RBAC and ABAC are well established choices for smaller apps. Most GQL ORMs/middlewares support RBAC and possibly ABAC.
Hasura is the same. In fact its permissions system is more flexible. But if you need anything more than RBAC (big caveats coming!) then you’re going to end up using webhook-based authz. There are managed services that can help you with that, Oso offers one for example (but that’s not an endorsement - I’ve never used Oso). However, this exposes you to significant performance challenges.
Even if RBAC is all you need, you need to stick to two strict rules to retain sane performance and to avoid locking up your database:
Rule 1 is going to protect you from the SQL statement size and execution time explosion that happens when a frontend GQL query looks up nested models, each of which requires a nested permissions check. In the ideal case, Hasura can evaluate all the permissions from the session variables, requiring no execution time in your database for permissions (and possibly not even hit your database if you get a logic-layer cache hit). If you need to look up against another table, keep it to one and ideally try to use the same lookup across multiple tables, so the database can optimise effectively.
Rule 2 will enable Hasura (and other systems) to apply server side caching. This turned out to be an absolute deal-breaker for our app. If any permission-relevant session variables are unique, it creates a separate cache entry (or none at all if the user ID is relied upon). As such, to get scalable performance, using common/group IDs is critical. Users looking up the same data will share the same session variable values and thus hit the same cache entry.
Although 1 and 2 may seem obvious, they weren’t documented when my colleague and I started out, and Hasura did a great job of marketing their supposed (but sadly, hopelessly underdelivered) realtime and scalability capabilities. We ended up in a nightmare situation and took 5 months to re-engineer the system (made worse by the API surface area).
REST APIs haven’t exactly solved the permissions problem either. In contrast though, there are well established, well understood, and crucially: well-documented, approaches to implementing authorization of REST APIs.
Eventually GQL ORMs might be ready to handle this problem. I think it’s probably a matter of time, rather than anything else. But combined with the surface area issue (which isn’t going away), GQL is unviable for anything other than simple apps or large teams.
Caching applies in 3 places:
You might also choose to add custom backend caching such as through Redis. This is not what I’m interested in with respect to GraphQL.
GraphQL defeats caching at each of these levels in a way that severely degrades performance until you put some serious effort into customising/optimising the system.
Off-the-shelf libraries/ORMs like Urql, Apollo, Hasura and Postgres claim to do caching out of the box. In practice, and again not helped by the surface area of GQL APIs, they don’t work. Really, honestly, don’t be drawn in by the marketing and online hype. You’re going to have insoluble headaches and spend hundreds of hours on optimisation.
So how does it defeat caching at these levels?
GraphQL doesn’t capture relation-key metadata
For example, foreign keys used to relate database tables
This is killer for frontend caching. Document caching for small, always-online, low-uptime-guarantee apps can suffice. But for “heavier” apps, normalised caching is an absolute must, and without relation keys normalised caching is largely defeated.
Even worse for our app, we had users all changing pages in-sync (at the start/end of conference sessions). Usually, they’d go straight to the schedule page - a super data-heavy page but where the data rarely changes and would ideally be persistently cached. Additionally, after choosing an event, they’d go to a page for that event that required a subset of the same data - an ideal case for normalised caching.
Urql and Apollo both offer normalised caching plugins, where they try to auto-detect the ID field or let you specify one manually. But they fall short of actually handling foreign keys. So in a significant number of situations, queries will get sent to your backend even if the data is present somewhere in the cache.
This comes down to the GQL specification for introspection. It offers no way to encode relation-key information (which is a tad broader than foreign keys).
I attempted to write a tool around Hasura to augment the GQL schema with custom metadata. Sadly, Urql was so difficult to work with that I didn’t have time to complete the cache extensions. It was a promising approach though. If you think you have time to pick this up, please let me know.
I had a chat with a member of the GQL maintainer community about submitting a revision to the GraphQL specification to add this option to the metadata. Maybe we’ll get around to it some day.
Caching at the business logic level is prohibitively difficult to implement as permission rules become more complex. Generic solutions fail dramatically.
There are myriad possible approaches to caching at the business logic layer but most GraphQL middlewares promise a generic solution. A “switch it on and forget” option.
In practice these don’t work. To be charitable, if you put a lot of time into understanding the middleware’s design (if it’s OSS or has good enough documentation) then you can start to exploit caching for bulk data where permissions involve only organisation-/group-level identifiers (not user ids, subgroup/team ids or similarly ‘detailed’ attributes).
If you want something that can do ABAC or Zanzibar-scale authz tuples, look away from GraphQL. Realistically, you will want one of these. Most apps of scale end up needing at least the capabilities of ABAC (at least, the ones that I’ve seen or used).
This might sound like just a limitation in GQL middlewares. But there’s a reason the middlewares share a common weakness. That reason is ‘GQL is too generic’. A GQL API is one big graph with each node having its own permission rules for which it is impossible to implement an efficient cache in all possible cases. Most computer scientists should look at this and immediately see the challenge, and run scared of anyone claiming a generic solution.
There’s another aspect to this problem which hits hard much later on in development in two ways (you will experience one, or the other or both).
Either you will hit a development wall, where a long and deep review and redesign of the API is necessary to take the next step in scale or complexity of your app (as new features eventually demand evolution of the app’s core).
GQL makes every part of your API intimately connected to every other part, because frontend queries can hit so many nodes at once (even if you’re pretty careful with your queries, which basically just shifts a backend problem onto frontend engineers!)
So when it comes time to add the next big feature, which adjusts just a few nodes in the graph and adds a new permission attribute, a deep and long review, test and inevitable reworking of the app is going to happen, even if you do it piecemeal and pretend you’re working on tech debt 😉
Or, you’re going to wake up one day to a massive failure. Performance of your app will suddenly fall of a cliff and climbing back up isn’t as easy as rolling back recent changes. Adding a couple tens of thousand users can take what was seemingly a great system to a complete disaster.
Your middleware’s cache will hit a limit - probably a capacity limit, but possibly an algorithmic limit. Neither of these is going to be easy to solve because adding capacity to a cache usually adds latency. Adding latency may well tip your system over another performance boundary that you thought was safely cleared.
By this point, you should be getting the impression that the genericism and surface area of GQL and GQL APIs is a big problem for caching. Especially as GQL actively avoids putting necessary information into the schema.
So it should come as no surprise that this is going to do nasty things to your database’s caching mechanisms. Albeit, databases have been designed over decades to try to solve this totally generic caching problem, and the database can exploit its ‘hidden’ information about the structure of your data.
GQL for relational databases is a nightmare. You will inevitably end up with queries full of nested joins, unreadable variables and relatively little obvious re-use of common queries (which the database could exploit if they were better separated). This is what ORM middlewares produce and there’s no way around it. You might be better off if you’re using a document store or NoSQL database, but stories from friends don’t bode well. The flexibility and variation in GQL queries leads to too much variation in queries to the database and queries that can too easily become far too large.
Other problems can also become challenging to avoid. Queries that may seem independent might need to hit the same intermediary table, resulting in contention. If a third (independent) query mutates that table, there’s a big risk of unexpected lockup. It can be extremely difficult to trace (tools like DataDog become an absolute necessity) and they’re hard to predict / spot, especially with frontend engineers working “independently” (without needing full intimate understanding of your backend, right?). This presents a big risk to your production application. A problem you can’t predict, can’t be tested and so might propagate through to production silently until one day, two users show up at the same time and take the whole caboodle down.
Some of these issues can be solved. But why give yourself the stress when REST APIs are better understood, with well known solutions to most challenges, and scalability is proven ground.
I’ve already outlined a bunch of performance problems. Here’s a few more just for good measure.
Query size
The point of GQL is you get a graph over which you can make powerful queries that would be tedious to implement as REST endpoints. If all you wanted was CRUD endpoints, you’re definitely better off with just an auto-generated REST API.
But “powerful queries” is double speak for “large and complex”.
Large queries are a bad thing. Complex queries are a bad thing. These are fairly well known realities. So GQL is basically asking you to ignore reality and use it anyway.
Permissions
Did I mention that permissions are going to be difficult? Heck yeah. RBAC is fine. If all you need is pure RBAC - I mean, really pure RBAC - you might be okay. If you need an organisation id too, you’re probably still ok.
Oh, you need a team id? Sorry, GQL just exploded the complexity of your permissions as an exponent of the number of tables and operations you have.
Debugging / tracing
Haha, as if. Hasura only recently introduced basic tagging of queries but I promise you you’re going to waste many hours on this issue. When you’ve got queries that can hit several parts of your API surface at once, plus the usual challenges of distributed and concurrent/parellel systems, plus GQL doesn’t come with a structured approach to tracing, you’re in debug hell.
Leaky abstraction
This is the worst offence, in my opinion. Your API should be a boundary between your backend, frontend and everything external. Consumers of the API shouldn’t need to worry if they’re going to make an API call that will take down a part (or all) of the system.
Yet GQL leaks performance across this boundary - really badly. You entire API is one big graph, so any consumer can hit a wide and disparate set of nodes all at once. Middlewares like Hasura have added features to constrain this mess a little such as depth limits, returned row count limits and one or two other bits. But these don’t stop an API consumer hitting a small set of unrelated endpoints that can put a lot of unexpected strain on your system. Ultimately, features around this are sticking plasters on a not-quite-uniquely GraphQL problem.
GraphQL is immature, particularly the tooling around it. Change management on a GQL API is supposedly a solved problem, but the tools aren’t open source. If you’re the size of Facebook, you can probably afford the cost of the development processes and devops around this. If you’re anything smaller, I think it’s a struggle. If you’re a startup, maybe it doesn’t even matter. If you’re a scaleup, it’s a nightmare no-person’s-land.
Hasura, for example, has a built-in migrations system for the database (technically not an API change management system). It’s weak by comparison to Prisma and suffers from lack of maturity. It has seriously punishing problems, like the fact that it fails to track database migrations caused by metadata changes in sync with database migrations created through data model changes. The latter you control, the former only get applied after all the “controlled” migrations. This turns out to be a disaster because event triggers are metadata-driven not migration-driven. So you can easily land in the situation where you remove/modify event triggers (which migrates Postgresql triggers), then make a data model change (which generates a migration SQL file and applies it) and then update the event trigger once more. Works in dev, but Hasura will later have no record of the first metadata change, tries to apply the data model migration and then apply the final trigger migration generated fresh from the metadata. The data model migration may well fail (as happened frequently when my colleagues and I were building with it) because the triggers block the changes.
API changes are also difficult to evaluate for impact. Frontend queries may range across large sections of the API in a way that is tricky to track (even with great testing and great tools). Validating even relatively minor API changes becomes a big overhead - which isn’t how GQL is marketed to startups / scaleups.
Not forever, but for now, yes. In the startup I’ve just joined we’ll be using traditional REST, with a declarative database schema and approach to migrations using Prisma. We probably won’t be using Prisma’s middleware - it’s not worth it for us.
Permissions will be ABAC based, with only a small set of attributes to choose from. This will keep things simple and means we can implement JWT plus webhook/single-library-function-based authorization on API endpoints.
Frontend caching will be normalised and easy to generate from the declarative, fully-encompassing Prisma schema. Backend caching will be driven by a traditional combination of local memory caching, Redis-style in-memory database(s) and/or AWS equivalent services. Since we control all the endpoints, we can specify and limit them tightly, and performance/load test them accordingly. We’re relying on AWS RDS services to carry the load on the database and help us scale without re-engineering the scaling systems.
Change management with Prisma is a bit easier to manage. We’ll approach stuff with the recommended two-phase-to-change approach: e.g. to replace a column:
Really, this is all traditional REST stuff. We’re not doing anything new or special on this aspect, and that’s the whole point.
GraphQL is too new and too shiny for real-world use in anything smaller than a very large enterprise. To be honest, given its current (extreme) drawbacks, I’m not sure it ever will be ready for smaller organisations. I will be happy to watch and maybe to be proven wrong. It’s certainly a neat and powerful framework.