SmackerNews

OpenAI: Scaling PostgreSQL to the Next Level

190 points · 139 comments · 10 months ago · thunderbong

pixelstech.net

samwillis10 months ago
I was at PGConf last week, and this was one of the most packed talks - a great insight into using Postgres, where most of the conference was fairly inward facing, with talks around the development of Postgres itself (pgconf.dev is very much that one, out of all the others each year).
What you have to remember is that for many teams, when their product takes off, they are not equipped with the deep internal knowledge of how to scale a particular part of their stack. This was an awesome story from a small team having to tackle those challenges, and how they were learning as they went. So, while there are some of those "can't you just", and "whats interesting about this?" comments here, with the narrative of the growth rate, and the very high profile of the product, it was the perfect user talk for a internal development focused conference.
The key insight, and main message, of the talk was that if you are not too write heavy you can scale Postgres to very high read throughput with read replicas and only a single master! That is exactly the message that needs to be spelled out as that covers the vast majority of apps.
As an observation, in the Q&A at the end of the talk the questions, primarily from core Postgres developers, were focused on learning about the use case, and not an opportunity to suggest that they were doing anything wrong (not quite the same as this thread could get). A genuinely awesome group of very friendly and welcoming people in the Postgres community.
VWWHFSfQ10 months ago
The presentation also specifically mentioned that using ORM can easily lead to inefficient queries and should be used cautiously.
Every ORM is bad. Especially the "any DB" ORMs. Because they trick you into thinking about your data patterns in terms of writing application code, instead of writing code for the database. And most of the time their features and APIs are abstracted in a way that basically means you can only use the least-common-denominator features of all the database backends that they can support.
I've sworn off ORMs entirely. My application is a Postgres application first and foremost. I use PG-specific features extensively. Why would I sacrifice all the power that Postgres offers me just for some conveniences in Python, or Ruby, or whatever?
Nah. Just write good SQL for your database and the whole system will be very happy.
tux310 months ago
Self-hosting postgres is appealing from a flexibility standpoint (don't be locked out of superuser or advanced features), but it sounds a little bit nerve wracking.
I'm now hoping all the cloud providers read this article and start exposing the feature to try disabling an index in the query planner before you drop it for real, that should really become standard procedure. Just in case.
But if you're a large scale company to the point of wanting to own and customize your stack, it can definitely make sense to self-host.
cryptonector10 months ago
Feature Requests
Concerning schema changes: they desire PostgreSQL to record a history of schema change events, such as adding or removing columns and other DDL operations.
You can do this right now today by using `EVENT TRIGGER`s. You can check out things like Aquameta[0] (if I remember correctly) to see how it's done.
[0] https://github.com/aquametalabs/aquameta
venriq10 months ago
I don't deny how powerful PostgreSQL is, but it appears that the decision to choose PostgreSQL was the most significant issue all along. It's shocking how often we find teams selecting Postgres to implement a solution that it's not suitable for. Yet, you see a document with a forced and non-logical narrative trying to justify the decision.
With the NewSQL options available today, which provide distributed relational databases with multi-masters out of the box, it seems to me that many teams select Postgres simply because that's all they know, and that's the source of the problem.
williamdclt10 months ago
Not super interesting, this is fairly basic stuff that you'll encounter at orders of magnitude smaller scale than OpenAI. Creating indexes CONCURRENTLY, avoiding table rewrites, smoothing out traffic, tx timeouts, read replicas... It's pretty much table stakes, even at 10000x smaller scale.
Their requests to Postgres devs aren't anything new either, everyone has wished for it for years.
The title is kind of misleading: they're not scaling it to the "next level", they're clearly struggling with this single-master setup and trying to keep it afloat while migrating off ("no new workloads allowed"). The main "next scale" point is that they say they can "scale gracefully under massive read loads" - nothing new, that's the whole point of read replicas and horizontal scaling.
Re: "Lao Feng Q&A":
PostgreSQL actually does have a feature to disable indexes. You can simply set the indisvalid field to false in the pg_index system catalog [...] It’s not black magic.
No. It's not documented for this use, so it's not a feature. It's fooling around with internals without guarantees of what this will do (it might do what you want today, it might not in the next release). Plus as they point out, managed Postgres providers don't let you fiddle with this stuff (for good reasons, as this is not a feature).
there’s a simpler solution [to avoiding accidental deletion of used indexes]: just confirm via monitoring views that the index is not being used on either primary or replicas
That doesn't quite solve all the same problems. It's quite frequent that an index is in use, but is not _needed_: another index would also work (eg you introduced a new index covering an extra column that's not used in this query). Being able to disable an index would allow checking that the query plan does use the other index, rather than praying and hoping.
davidkuennen10 months ago
They seem to be using physical replication. I'm currently thinking of switching to logical replication to reduce inter region egress cost.
Do you think that's a good idea? There seems to be many improvements to native logical replication since Postgres 17.
vb-844810 months ago
I wonder how much their performance can improve if they put the write-instance on dedicated servers (with local and very fast ssd) and use managed services only for read-replicas.
redwood10 months ago
What I find odd about this is there's no mention of all the other engines that must be in the mix powering different types of queries: I have no doubt they're using a little of everything, from scaling key-value to search, vector search, caches... They must be doing summersaults to avoid over-saturating this over-saturated Postgres env... yet only Postgres is discussed here.
airstrike10 months ago
Title should really be "Scaling PostgreSQL to the Next Level at OpenAI", which is the actual title of the talk
bhouston10 months ago
Argh. Shard the damn database already.
Why are they not sharing by user/org yet? It is so simple and would fix the primary issue they are running into.
All these work arounds they go through to avoid a straight forward fix.
harisund199010 months ago
Seems like a perfect use case for a distributed database like YugabyteDB. Have you looked into it?
kabes10 months ago
"He finally concluded with some requests to the postgres developer community"
... You're one of the most well funded companies in the world, you shouldn't be asking for features to open aource devs, but you should be opening PRs
philosophty10 months ago
OpenAI and these companies hires inexperienced people with zero operational experience and this is how they run things. It's almost funny if you didn't see how unreliable the end result was.
Postgres is powerful but just not suited for this role. But if your only tool is a hammer...
belter10 months ago
Many of these scaling issues would be solved, if they simply used AWS and Aurora Postgres.
https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
https://youtu.be/pUqVCK7Ggh0

news.ycombinator.com/item?id=44071418