REINDEX concurrently made a huge difference for us.
We have a main database that started at version 9.6 and was upgraded along the way. The largest table is huge (billions of rows, TB’s of diskspace) and gets a lot of deletes and updates.
Vacuums could no longer finish on that table (we killed it after ~90 days).
Reindex (+ vacuum with skip-indexes) dropped our db load from 60 to 20 and fixed the autovacuums, which now take less than a day. The indexes on that table had accumulated a lot of bloat, and I think newer versions also improved the index disk layout.
We now have a monthly cron job to reindex all indexes.
It’ll probably shave 50% off that table’s disk space.
However, we’re actually in the process of sharding the db, and by copying customer by customer we’ll lose the bloat that way. The subsequent shards will be a lot more manageable so we can run pg_repack there with more confidence.
I think I would recommend setting log_min_duration_statement first and watching the logs for some time before doing that. So that you know what's going to get whacked, have some opportunity to tune it, etc.
Edit: It is mentioned, so perhaps just talking about that prior to talking about setting the timeout.
Good catch, will definitely update or perhaps even re-order. I'm not sure either in isolation gives you everything you need, but point noted that some sense of what is getting cancelled is equally important.
I think there's a case to be made that a timeout is more important. With an OLTP workload, it's not hard to imagine a runaway query running for days, consuming I/O bandwidth and silently slowing everything down. A timeout may break some things, but it will do so a lot more clearly. (Of course, both settings are a good idea.)
How often would a runaway query be able to run for days without the connection being dropped? I would expect either the incoming client request or the background job that started the query to have their own timeouts, and to dispose of their connections/transactions when killed. If a request or job does last for hours, it's probably easier to notice compared to a query, due to generally better observability tools.
The only time I've seen a running query lasting for days in prod was when a human database toucher had forgot their DBeaver tabs open after running a complex query with too few filters.
Due to some Postgres limitations, a query can run even after the connection drops (the backend process only checks if there's still a client to send results to after query completion). I think I saw some work being done on this in 14 or the upcoming 15 (i.e., to re-check periodically during query processing), but a lot of Postgres versions are still affected. I worked on the Heroku Data team and we regularly saw queries running for days. I think starting with an aggressive timeout to fail fast where you can't afford a slow query anyway, and then make exceptions where it makes sense, is easier to work with than tracking down problem cases and putting in limits after they cause issues.
Weekly VACUUM ANALYZE on some of the busiest write tables made me go from sleepless nights to bliss.
And every 6 months we noticed certain queries getting slower and slower. That's because the amount of data growth means we have to have a different approach to indexing or the way how we access it. So our queries change a few times a year.
Otherwise, on a 650 Gigabyte database, there is remarkably little maintenance needed, except testing restores on daily backups and testing replicas.
The buggyness of Oracle surprised me. I expected complicated (but underestimated how complicated they make things) but what I did not expect was the bugs.
After having worked with PostgreSQL, MySQL and Oracle I cannot understand why anyone would pick Oracle. Its advantages can't be worth the hassle even if we ignore the license fee.
The biggest Postgres related lesson I learnt recently is the importance of running ANALYZE and having up to date table statistics. In our case it was the difference between queries taking several seconds to run and taking <100ms.
We get much higher traffic during the daytime, so we now have a cronjob that runs VACUUM ANALYZE each night.
We also have some small metadata tables that are very update heavy as we sync these from another data store, replacing all rows each time. We now run VACUUM FULL on these after each sync (this locks the table but is fast (20-40ms) on such small tables) to avoid bloating them over time.
It seems to me that there's something major missing from Postgres if you need to manually run something daily or else the queries can go from 100ms to "several seconds."
If this is such a standard requirement and the statistics are so vital to performance, why isn't something built-in to the engine to keep these up-to-date without a user intervention?
I don't have to run "ANALYZE" on DynamoDB daily to ensure performance doesn't tank.
This is usually done automatically in Postgres in the background by the Autovacuum process. For high volumes of traffic or certain usage patterns you might have to tune the Autovacuum settings, or it might not be able to keep up with the more conservative default settings.
One issue is that there is some counter-intuitive behaviour here, if you see the Autovacuum taking significant resources, the worst thing you can do is to let it run less often. You actually need to make it more aggressive in that kind of situation, and/or fix your usage pattern or add more resources.
If a manual ANALYZE is necessary, this can often indicate a misconfiguration of Postgres, e.g. someone reducing the Autovacuum frequency or disabling it entirely. Postgres also got a lot better at this, so it also matters how old your Postgres version is.
Agreed, I never faced these issues as our database is relatively small but it's weird to see so many manual commands used for maintenance when these could be built-in PG ongoing process.
It is built-in. The autovacuum process is doing this. When you have a large database you may need to tune some config parameters, as the default conservative setting for low/mid databases are naturally different.
Running these manual commands is just shifting the config tuning for large databases to periodic „manual“ operations.
+1 to both of these approaches. Maybe once per year we would have a query plan regress because of out of date column statistics. This happened on some pathological cases like tables that had old data evicted frequently. Once we committed to running VACUUM ANALYZE nightly on a cron, we never saw query plans regress out of of the blue again.
ANALYZE is the first thing I run when looking at a problematic slow query, if you aren't aware of it, you could spend hours trying to figure out why the planner was doing A when it should be doing B.
It should, but it may need to be tuned to run aggressively enough, which can be tricky. Sometimes it's easier to just schedule a manual job that runs during your off hours. There are some interesting discussions on the Postgres development list on improving VACUUM performance [1] and other improvements to the overall MVCC mechanism to avoid the need for VACUUM in the first place if possible [2]. Postgres, like any complex system, does have pathological cases. It helps to keep an eye on performance and learn the system as your database grows. But it's good to see the community is thinking about how to improve the situation.
> If you see idle is above 20 it's recommended to explore using PgBouncer.
Exploring pgbouncer when you have lots of idle connections is a great tip, but 20 idle connections feels _extremely_ low to me. I've seen postgres databases on AWS Aurora serving over 13,000 transactions per second with hundreds of idle connections (because of client side pooling with a few dozen backend clients) just fine. In fact, around that scale is when we switched _from_ pgbouncer to client side pooling to simplify our architecture, and we noticed no degradation in any major metrics.
Isn't AWS Aurora PostgreSQL a completely different (but compatible) product to PostgreSQL? If so it would make sense that its ability to handle many connections is just an implementation detail
Aurora forked PostgreSQL very early on (before 9.6) and has been improved independent and a lot has been changed. They are effectively different databases now.
They changed the storage layer and the query planner in some parts to work correctly with their new storage layer. So it‘s a fork and not behaving exactly identical. I have seen some different execution plans for the same model and data because of the implications pf the changed storage layer and query optimizer.
Do you know of any credible source for Amazon forking postgresql? I did some googling and can certainly find lots of people saying it, but nothing authoritative from Amazon themselves.
AWS is basically never sharing any details, only when it‘s benefical in terms of marketing. The changes they made had not been possible with an extension in the past, and they are most probably still not fully possible without forking. But thats not problem, they are free to fork and merge upstream changes.
For MySQL any of the big users like facebook & co. Are running (heavy) forked mysql versions with changes they need.
Nope. There is an AWS blog post [0] from September 2021 about setting up pgbouncer in front of Aurora Postgres, and that blog post references the AWS RDS Proxy service [1], but Aurora doesn't have pgbouncer in front by default. For what it's worth, we also handled hundreds of idle connections just fine on vanilla RDS postgres.
I would have said, check your shared buffers memory and concurrency are configured right for the environment you are running on. The defaults are wrong for most production environments and the right values are highly server dependent. If you haven't specifically tuned them for your production server, chances are they are wrong and you are leaving substantial memory and / or concurrency on the table.
I don't have links right now, but if you search for something like "online postgres config tool", you should find two different pages that do the same thing - you enter some data about your environment, and it spits out recommended settings.
Ran into a pathological case where a query that usually took 10-20ms would sometimes, semi-randomly, take > 15 minutes (while the corresponding PG process server-side is running at 100% CPU)
Explain showed a dramatic query plan change, which led us towards bad table statistics.
There seems to be a common lifecycle of indexes within applications. First you start off with almost none, maybe a few on primary keys
Not to be rude or anything but I hope he doesn't mean this.
Every single postgres primary key (and unique constraint) automatically gets an index. That's how the unique constraint is implemented. Primary keys being naturally unique.
I would agree IFF there wasn't the 'maybe'. If that word wasn't there I would read it as having not many tables in the beginning and thus not many primary keys and thus not many primary key indices.
I don't disagree, but Mr. Kersteins has quite the track record of Postgres excellence, so he gets every benefit of the doubt from me. I feel quite certain he understands how primary keys work! I'm sure this was just a miswording.
Fair enough. I don't know him though (never heard the name) and after skimming through the first paragraphs of each item and reading this I closed the window instantly as for me it invalidated the information where I had no in depth knowledge myself. As in 'how can I trust any of the rest if something so fundamental is off'.
Very unfortunate if what you say is true. I guess I'll give him the benefit of the doubt then and go read the rest.
I don't know the author, but when I read an article by someone who obviously knows something about Postgres (knowing something about Postgres myself), I feel it's much more likely that they made a slight language error.
Some schema changing statements drop the statistics.
We learned this the hard way. We altered some integer columns to bigint. That cleared the statistics for those column and caused terrible query plans. An ANALYZE fixes this, but it took us a few days to notice.
Is there any documentation which statements are doing this. Had watched this in the past too, but can‘t remember which one was responsible for it and whether it changed in the last versions.
I haven’t found any. In our case, we did an alter table .. alter column, which I assumed would he fine. In retrospect, it does make sense that PG recreates that column from scratch and thus it doesn’t keep the statistics.
Since then we typically include an analyze statement whenever we do a large change, or rewrite a lot of rows.
I would love to drop MySQL/MariaDB and go with PostgreSQL, but information like this makes me nervous that I would setup a footgun and be constantly debugging things because I am not a DB admin expert.
Anyone care to comment on how PostgreSQL works out of the box without being an expert?
(I've run MySQL/MariaDB for almost 20 years and there are very few issues I've been surprised by)
I converted a large MySQL DB to PostgreSQL in 2004 and never looked back. Think about how many improvements have occurred in Pg releases since that time and the availability of information. It took my team about six months to do the conversion as there were any number of type conversion problems due to programming assumptions and loose typing in MySQL at that time. There's also an excellent support community on IRC at Freenode (you can google it) and they are very helpful to people of any level of experience.
Optimizations as presented by many are not needed right up front and you'll learn them as you go along. Good luck!
I’m definitely not an expert and I haven’t found running postgres any more difficult than mysql. There are some differences that take a bit of getting used to, though.
There is reindexdb and vacuumdb in bin directory of postgresql instance. They cover 99% of maintenance use-cases when run from cron, mainly those mentioned in comments. For general purpose workload i find default settings reasonable.
How would I know when to run these? Or if I need them?
I know I can do research, get training, etc... but I have never done maintenance like this with MySQL/MariaDB, so it's the unknown unknowns that worry me.
Like a lot of other complex systems, there's a lot of tweaking you can do in the config, but generally postgres works out of the box pretty well. There's a few gotchas on setup (postgres has a very conservative default configuration for things like memory usage and parallelism), but it's still pretty good.
I'd say the biggest footgun is really not knowing about the process per connection limitation, which is why the article mentions pg bouncer. Everything else in the article is pretty geared towards setting your db up for monitoring so you can head off issues.
> I would love to drop MySQL/MariaDB and go with PostgreSQL, but information like this makes me nervous that I would setup a footgun and be constantly debugging things because I am not a DB admin expert.
Has MySQL gotten... a lot better, or are you just used to its quirks? I feel like the UTF encoding and ... date format? timezones? are/were all really big footguns.
The encoding stinks, but it seems "fixed". (utf8mb4) I just migrated a few hundred databases between servers with the old dbs not having proper/modern encoding, and it was a bit messy to clean up.
Date format... not sure, I just use datetime now, timezones I manage in app (never trusted db engine for that...) Maybe I avoided those by accident with unix time until I moved to datetime.
I kind of wish the title was "Healthier and Happier".
Despite this calamitous oversight (wink), articles such as this are a great source to be able to draw on others' experience and reap the benefits of others hindsight without experiencing the outages or time consuming problems yourself.
It would be awkward if pgbouncer were "included" in postgres as the topology of bouncers you have might involve multiple layers running on different computers to minimize things like connection setup latency.
As someone who's used MySQL for 17 years and dabbled a little in Mongo is there any reason to try and switch to Postgres?
Obviously I love learning new things, but I've just never felt inclined to try it out, whereas I'm always jumping between other languages and frameworks within them. Not sure why that is.
I always get flamed when I mention this (I don't mind it, btw) but...
I find it pretty interesting that for the DB write-intensive portion[0] of the TechEmpower web framework benchmarks, you must go further than entry 150 (sorted by most requests-per-second) to find MySQL used, compared to PostgreSQL or Mongo. For the DB read-intensive portion[1], only 15 of the top 100 use MySQL (the rest in that bracket use PostgreSQL), and the first entry there comes in at number 44. Important: if you examine the entries, you'll find that some of the frameworks have multiple listings for different configurations, including the use of MySQL vs PostgreSQL (in other words: same framework, but only the DB is different, in the same test).
IMO there has to be a solid business case to switch a database layer. If MySQL serves your business well, I’d not switch. I like MySQL and PostgreSQL. I’ve used both. I felt the decision to go one way or the other depended more on how the company could maintain it after all decision makers were no longer there. Where MySQL was chosen, getting tooling and support from Oracle or Percona was seen as the biggest benefit. Where PostgreSQL was chosen, quality optimizer and open source was seen as the biggest benefit.
I’d highly recommend to try PostgreSQL yourself and learn it’s config, permissions, replication, administration and querying capabilities. You’ll appreciate PostgreSQL’s casting with :: for example. If you had to start from scratch, you’ll be better informed about PostgreSQL and can include it in your selection process.
When we used MySQL, we loved the ease of replication and tooling such as Monyog/Webyog/Workbench. When we used PostgreSQL, we loved the query optimizer and JSON functions.
Killer feature for me was transactional schema changes. We use flyway for db migrations which works fine but a failed migration is left in a half completed state in MySQL - Postgres can execute the migration in a transaction that rolls back on failure.
I haven't had a chance to use mysql 8 yet (which I believe narrows the gap between postgres and mysql), but whenever I'm interacting with databases on mysql 5.7 I miss features like common table expressions from postgres
We have a main database that started at version 9.6 and was upgraded along the way. The largest table is huge (billions of rows, TB’s of diskspace) and gets a lot of deletes and updates.
Vacuums could no longer finish on that table (we killed it after ~90 days).
Reindex (+ vacuum with skip-indexes) dropped our db load from 60 to 20 and fixed the autovacuums, which now take less than a day. The indexes on that table had accumulated a lot of bloat, and I think newer versions also improved the index disk layout.
We now have a monthly cron job to reindex all indexes.