Last week I had the opportunity to discuss performance cliffs with the Malmö PUG. The meeting was a great success, with a cozy atmosphere and an engaged audience. I highly recommend attending or even giving a talk there.
After the meetup, I realized it's been a while since I posted about a patch idea, and the performance cliff discussion had some good candidates for improvement. One of the examples that caught my attention was queries with IN clauses. It's a well-defined and isolated problem, making it a great candidate for our first patch.
The problem with IN queries is that if you don't have an index on the column being queried, they can perform a sequential scan and require membership checks for each row. There are multiple ways to handle this, such as linear search or sorting and using binary search, which can significantly improve performance. However, the current implementation uses a naive linear search for short lists and builds a hash table for longer lists.
The issue lies in determining the threshold between long and short lists. The commit that introduced hash tables for long lists hard-codes this threshold, but it's not clear where exactly to draw the line. I showed an example during the talk where reducing the list length doubles the query duration, which seems counterintuitive at first. However, removing elements from the list can cause a switch from a hash table lookup to a linear search, and comparisons may become more expensive.
To address this issue, we need to collect accurate cost data for each strategy in a cheap way. This may require thinking outside the box and experimenting with different methods. One approach could be to measure the time taken by each check using RDTSC or HPET with sampling, but we also need to ensure that our measurements are accurate and reliable.
One possible implementation strategy would be to stick to the decision made by `convert_saop_to_hashedSaop` during query planning for a certain number of rows (say 1000?). Measure the cost of each check and then switch to the other strategy to get an estimate of its performance. After adjusting the strategy, we can verify our decisions periodically to account for any data-dependent costs.
The main risks associated with this patch are finding a good way to measure the cost of the different strategies accurately and efficiently, as well as ensuring platform support across various architectures. While RDTSC might be a viable solution on x86, it's essential to explore alternatives for other platforms like arm64.
This patch may not address performance issues directly but could provide a robustness improvement by adapting to changing data distribution and query patterns. I believe the changes should be relatively straightforward, but I'd love to hear your thoughts and feedback on this idea. Feel free to reach out to me directly or discuss it with other Postgres developers in thepgsql-hackers community or our new Discord channel.