Point lookups are one of the three most common operations in Snowflake (the other two being analytical queries and transactions). In a point lookup, customers are typically looking for a handful of records, and they want to find them very quickly. As one can imagine, point lookups are very popular in interactive dashboards and data applications trying to conduct “needle in a haystack”-style queries.
Many analytical queries also need to have a fast filtering capability; for example, when fact tables are being joined with dimension tables, and when aggregation queries need heavy filtering. We tend to call these analytical queries searches. To speed up lookups and searches, Snowflake introduced the Search Optimization Service about a year ago, and today we are announcing several significant improvements that bring customers cost reductions, better observability, and a more powerful search.
Search Optimization Service cost reductions
Before looking at improvements, let’s review how the Search Optimization Service (SOS) works: it creates an optimized data structure, called the search access path, that, given a value, prunes the micro-partitions (MPs) that do not contain it. In our sample query that looks for the name “Martin” in a table, SOS will identify micro-partitions that do not contain that string, and instead of a full table scan, will scan only the remaining few partitions, saving a lot of time and resources.
In June of this year we announced an optimization that significantly reduced the cost of using the Search Optimization Service. Within the same month, more than 85% of our SOS customers saw a 50% or more reduction of their search optimization maintenance costs. We achieved this by improving the performance of the background warehouses where the search access paths were being maintained.
But we are not stopping there. Today we are launching the public preview of search optimization column configuration, enabling customers to choose which columns to include in the search access path.
You can continue relying on our default behavior to automatically maintain the search access path for a given table if you want to save time. Alternatively, you can explicitly select columns to more tightly control costs. For example, you could choose columns that you use for joins, and that you frequently filter on.
ALTER TABLE T ADD SEARCH OPTIMIZATION ON EQUALITY(C1,C2,C3);
Better observability of costs and benefits
In the past, customers were able to use the ESTIMATE_SEARCH_OPTIMIZATION_COSTS function to decide whether to use SOS in a table. The output of the function provides estimated build costs, storage costs, and projected monthly maintenance costs. In June we improved this function to be sampling-based, yielding even more precise estimations.
The new search optimization benefits view, currently in private preview, allows users to track the number of micro-partitions that were pruned by SO and did not have to be processed, saving users query costs and reducing query duration. These stats are available on a per-table basis and can be used to determine if SOS is cost effective for customers.
New, powerful search optimizations
Our initial set of search optimizations that we launched last year reduced the latency of queries that filtered data by using equality predicates on Numeric, String, Binary, and Date & Time data types.
Today we are announcing public previews of search optimizations in VARIANT, ARRAY, OBJECT, and GEOGRAPHY data types, and pattern and substring searches in String columns.
The addition of search capabilities in semi-structured data (VARIANT, ARRAY, and OBJECT data types) is a significant new capability as more and more users store logs and metrics in Snowflake. Users can add entire variant columns or subpaths to the search access path and Snowflake will index all leaf fields, automatically detecting data types. This new capability allows users, for example, to search for IP addresses in Variant fields and Arrays, and do so quickly. Our initial preview customers reported very significant improvements in latency (usually 2-3x reductions). One customer, a large financial services company, experienced up to 20x latency reductions (from 20 seconds to 1 second).
Example:
Searching for an IP address in a VARIANT column WHERE location:sender_info.ip_address=‘123.123.123.123’; Example: Searching for an IP address within an ARRAY WHERE ARRAY_CONTAINS(‘123.123.123.123’::variant, logs:ip_addresses);
Faster search in GEOGRAPHY columns storing shapes such as polygons, lines, and point collections, has substantially improved the interactivity of map display by partners such as CARTO. Co-founder of CARTO Javier de la Torre said, “Search optimization for Geospatial allowed CARTO to create map tiles dynamically 5 to 10 times faster than before.”
To enable fast geo searches, we added a new geospatial search method.
ALTER TABLE T ADD SEARCH OPTIMIZATION ON GEO(G);
With this method, tools that visualize shapes on maps and conduct searches in Snowflake’s GEOGRAPHY columns using Intersects, Contains, and other predicates, can achieve a much higher level of interactivity than before.
Example: Comparison of map refresh speed without (left) and with (right) search optimization
Searches in String columns such as VARCHAR were also improved. We are announcing today that searches of simple string patterns using LIKE and ILIKE, and more complex regular expressions using RLIKE are now much faster. To enable these improvements, we added a new search method “SUBSTRING” to the Search Optimization Service.
ALTER TABLE T ADD SEARCH OPTIMIZATION ON SUBSTRING(C4);
We also improved the performance of substring operations such as CONTAINS, SPLIT, SPLIT_PART, and SUBSTRING. These improvements will work even if you do not have SOS enabled in a table.
In summary, customers are now able to benefit from cost reductions through continued improvement of search access path maintenance efficiency and a new, flexible syntax for selecting a subset of columns for SO (in public preview). They gain better observability via a sampling-based cost estimation function (generally available) and a new SO benefits view (in private preview). And, they get a more powerful and faster search in Variant, Array, Object data types, in the Geography data type, and wildcard/pattern searches in String data types (all three in public preview).
Snowflake customers are asking us for faster and more flexible searches and we have big plans to deliver more search innovation in the future. Watch this space for more announcements.
Originally posted on October 22, 2022 @ 2:30 am