When Wildcard Search Goes Wild
Background:
At one of my previous companies, a product team experienced a surge in query timeouts while interacting with an RDS Postgres database. This issue arose shortly after a release, leading the team to reach out to the database group to investigate whether the recent changes were causing the problem.
Investigation:
Since the 26th, there has been a noticeable increase in CPU utilization. This spike aligns with the implementation of a GIN index to facilitate two way wildcard searches, enabling customers to perform highly flexible in-app contact searches. By doing so, customers would be able to search for contacts in-app with a great deal of flexibility.
Wildcard search performance with a GIN index and without a wildcard filter (normal search) was compared for multiple profiles with different data patterns.
Query plan for 10K row per profile account without gin index
Query plan for 10K row per profile account with gin index
Observations:
The query plan utilizing the GIN index is responsible for the highest logical I/O (LIO), with 66K shared buffer hits, mostly from GIN index access. Given the lengthy response times and high LIO, the CPU spikes are a direct consequence.
Every query includes a profile ID, ensuring that B-tree index access occurs regardless of GIN index usage. As a result, for small to medium-sized accounts, the GIN index offers little to no advantage.
The table also experiences frequent modifications, as indicated by the attached screenshot. In this case, the GIN index is likely causing more harm than good. A solid analysis of the negative effects of the GIN index can be found in the Gitlab Gin incident.
Quick Fix:
The CPU load dropped immediately after disabling the code path that relied on wildcard searches via the GIN index.
Further Problems:
DML on GIN tables was slowing down and that timeouts were occurring during update statements. GIN stands for “Generalized Inverted iNdex”. “Inverted” refers to the way that the index structure is set up, building a table-encompassing tree of all column values, where a single row can be represented in many places within the tree. By comparison, a B-tree index generally has one location where an index entry points to a specific row. When the table size increases, DML overhead for cases where you are modifying a commonly occurring value will increase significantly.
Long-term solution:
The GIN index was originally introduced to prevent timeouts for wildcard searches on accounts with large datasets (400K+ records). While the GIN index can enhance individual query performance in an OLAP context, it struggles to scale efficiently in an OLTP environment.
A more sustainable approach would be to trigger searches only after the user enters the first 4–5 characters, utilizing a trailing “%” in the LIKE operator. Full wildcard i.e. leading and trailing “%” prevents planner from utilizing normal B*Tree index, hence avoiding leading wildcard makes sense. Many websites implement this strategy with a single trailing wildcard (e.g., lower(email) LIKE 'test%'
). This would allow a "bitmap and" operation using standard B-tree indexes or even a composite index on relevant fields.
Ultimately, business input is crucial in deciding whether the performance costs of wildcard searches are justified, or if direct searches will suffice. If full searches by email or phone number meet business needs, simply expanding the existing B-tree index with additional columns should be sufficient. That would be good enough to handle all different types of profiles.
Conclusion:
Achieving optimal database performance requires more than just implementing cutting-edge technical solutions; it demands a deep understanding of the business needs driving those decisions. In this case, while implementing a GIN index initially seemed like a robust solution for handling complex searches for few very large profiles, it didn’t align well with the realities of an OLTP workload and the business’s typical usage patterns. The best-performing systems are those that seamlessly marry technical solutions with the actual needs of the business.