Druid v0.19.0 Release Notes

Release Date: 2020-07-21 // almost 4 years ago
  • ๐Ÿ“š Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

    # New Features

    0๏ธโƒฃ # GroupBy and Timeseries vectorized query engines enabled by default

    โœ… Vectorized query engines for GroupBy and Timeseries queries were introduced in Druid 0.16, as an opt in feature. Since then we have extensively tested these engines and feel that the time has come for these improvements to find a wider audience. Note that not all of the query engine is vectorized at this time, but this change makes it so that any query which is eligible to be vectorized will do so. This feature may still be disabled if you encounter any problems by setting druid.query.vectorize to false.

    #10065

    ๐Ÿ‘ # Druid native batch support for Apache Avro Object Container Files

    ๐Ÿ†• New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster. Check out the docs for more details

    #9671

    โšก๏ธ # Updated Druid native batch support for SQL databases

    ๐Ÿ‘€ An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new native batch ingestion specifications first introduced in Druid 0.17, deprecating the SqlFirehose. Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the driver from those extensions. This is a relatively low level ingestion task, and the operator must take care to manually ensure that the correct data is ingested, either by specially crafting queries to ensure no duplicate data is ingested for appends, or ensuring that the entire set of data is queried to be replaced when overwriting. See the docs for more operational details.

    #9449

    # Apache Ranger based authorization

    ๐Ÿ“š A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid, backed by Apache Ranger. Please see [the extension documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html) and Authentication and Authorization for more information on the basic facilities this extension provides.

    #9579

    ๐Ÿ‘ # Alibaba Object Storage Service support

    ๐Ÿ‘€ A new 'contrib' extension has been added for Alibaba Cloud Object Storage Service (OSS) to provide both deep storage and usage as a batch ingestion input source. Since this is a 'contrib' extension, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

    #9898

    ๐Ÿ‘ท # Ingestion worker autoscaling for Google Compute Engine

    Another 'contrib' extension new in 0.19.0 has been added to support ingestion worker autoscaling, which allows a Druid Overlord to provision or terminate worker instances (MiddleManagers or Indexers) whenever there are pending tasks or idle workers, for Google Compute Engine. Unlike the Amazon Web Services ingestion autoscaling extension, which provisions and terminates instances directly without using an Auto Scaling Group, the GCE autoscaler uses Managed Instance Groups to more closely align with how operators are likely to provision their clusters in GCE. Like other 'contrib' extensions, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

    #8987

    # REGEXP_LIKE

    A new REGEXP_LIKE function has been added to Druid SQL and native expressions, which behaves similar to LIKE, except using regular expressions for the pattern.

    #9893

    ๐ŸŒ # Web console lookup management improvements

    ๐ŸŒ Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.

    Screen Shot 2020-04-02 at 1 14 38 AM

    โž• Additionally, clicking the magnifying glass icon next to a lookup will now allow displaying the first 5000 values of that lookup.

    Screen Shot 2020-03-20 at 3 09 24 PM

    #9549
    #9587

    # New Coordinator per datasource 'loadstatus' API

    ๐Ÿ“‡ A coordinator API can make it easier to determine if the latest published segments are available for querying. This is similar to the existing coordinator 'loadstatus' API, but is datasource specific, may specify an interval, and can optionally live refresh the metadata store snapshot to get the latest up to date information. Note that operators should still exercise caution when using this API to query large numbers of segments, especially if forcing a metadata refresh, as it can potentially be a 'heavy' call on large clusters.

    #9965

    ๐Ÿ‘ # Native batch append support for range and hash partitioning

    โฌ‡๏ธ Part bug fix, part new feature, Druid native batch (once again) supports appending new data to existing time chunks when those time chunks were partitioned with 'hash' or 'range' partitioning algorithms. Note that currently the appended segments only support 'dynamic' partitioning, and when rolling back to older versions that these appended segments will not be recognized by Druid after the downgrade. In order to roll back to a previous version, these appended segments should be compacted with the rest of the time chunk in order to have a homogenous partitioning scheme.

    #10033

    ๐Ÿ›  # Bug fixes

    ๐Ÿ‘€ Druid 0.19.0 contains 65 bug fixes, you can see the complete list here.

    # Fix for batch ingested 'dynamic' partitioned segments not becoming queryable atomically

    ๐Ÿ›  Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' partitioned segments produced by a batch ingestion task were not tracking the overall number of partitions. This had the implication that when these segments came online, they did not do so as a complete set, but rather as individual segments, meaning that there would be periods of swapping where results could be queried from an incomplete partition set within a time chunk.

    #10025

    # Fix to allow 'hash' and 'range' partitioned segments with empty buckets to now be queryable

    ๐Ÿ›  Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning where if data skew was such that any of the buckets were 'empty' after ingesting, the partitions would never be recognized as 'complete' and so never become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the partitioning spec. These changes to the json format should be backwards compatible, however rolling back to a previous version will again make these segments no longer queryable.

    #10012

    # Incorrect balancer behavior

    ๐Ÿ”ง A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator operation in the event druid.server.maxSize was not set. This bug would allow segments to load, and effectively randomly balance them in the cluster (regardless of what balancer strategy was actually configured) if all historicals did not have this value set. This bug has been fixed, but as a result druid.server.maxSize must be set to the sum of the segment cache location sizes for historicals, or else they will not load segments.

    #10070

    โฌ†๏ธ # Upgrading to Druid 0.19.0

    ๐Ÿš€ Please be aware of the following issues when upgrading from 0.18.1 to 0.19.0. If you're updating from an earlier version than 0.18.1, please see the release notes of the relevant intermediate versions.

    # 'druid.server.maxSize' must now be set for Historical servers

    โฌ†๏ธ A Coordinator bug fix as a side-effect now requires druid.server.maxSize to be set for segments to be loaded. While this value should have been set correctly for previous versions, please be sure this value is configured correctly before upgrading your clusters or else segments will not be loaded.

    #10070

    ๐Ÿšš # System tables 'sys.segments' column 'payload' has been removed and replaced with 'dimensions', 'metrics', and 'shardSpec'

    ๐Ÿ›ฐ The removal of the 'payload' column from the sys.segments table should make queries on this table much more efficient, and the most useful fields from this, the list of 'dimensions', 'metrics', and the 'shardSpec', have been split out, and so are still available to devote to processing queries.

    #9883

    0๏ธโƒฃ # Changed default number of segment loading threads

    ๐ŸŽ The druid.segmentCache.numLoadingThreads configuration has had the default value changed from 'number of cores' to 'number of cores' divided by 6. This should make historicals a bit more well behaved out of the box when loading a large number of segments, limiting the impact on query performance.

    #9856

    # Broadcast load rules no longer have 'colocated datasources'

    ๐Ÿ”ง A number of incomplete changes to facilitate more efficient join queries, based on the idea of utilizing broadcast load rules to propagate smaller datasources among the cluster so that join operations can be pushed down to individual segment processing, have been added to 0.19.0. While not a finished feature yet, as part of the changes to make this happen, 'broadcast' load rules no longer have the concept of 'colocated datasources', which would attempt to only broadcast segments to servers that had segments of the configured datasource. This didn't work so well in practice, as it was non-atomic, meaning that the broadcast segments would lag behind loads and drops of the colocated datasource, so we decided to remove it.

    #9971

    ๐Ÿ”ง # Brokers and realtime tasks may now be configured to load segments from 'broadcast' datasources

    ๐Ÿ”ง Another effect of the afforementioned preliminary work to introduce efficient 'broadcast joins', Brokers and realtime indexing tasks will now load segments loaded by 'broadcast' rules, if a segment cache is configured. Since the feature is not complete there is little reason to do this in 0.19.0, and it will not happen unless explicitly configured.

    #9971

    # lpad and rpad function behavior change

    0๏ธโƒฃ The lpad and rpad functions have gone through a slight behavior change in Druids default non-SQL compatible mode, in order to make them behave consistently with PostgreSQL. In the new behavior, if the pad expression is an empty string, then the result will be the (possibly trimmed) original characters, rather than the empty string being treated as a null and coercing the results to null.

    #10006

    # Extensions providing custom Druid expressions are now expected to implement equals and hashCode methods

    A change to the Expr interface in Druid 0.19.0 requires that any extension which provides custom expressions via ExprMacroTable must also implement equals and hashCode methods to function correctly, especially with JOIN queries, which rely on filter and expression analysis for determining how to optimally process a query.

    #9830

    # Known Issues

    ๐Ÿ‘€ For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

    # Credits

    ๐Ÿš€ Thanks to everyone who contributed to this release!

    @2bethere
    @a-chumagin
    @a2l007
    @abhishekrb19
    @agricenko
    @ahuret
    @alex-plekhanov
    @AlexanderSaydakov
    @awelsh93
    @bolkedebruin
    @calvinhkf
    @capistrant
    @ccaominh
    @chenyuzhi459
    ๐Ÿ‘• @clintropolis
    @damnMeddlingKid
    @danc
    @dylwylie
    @egor-ryashin
    @FrankChen021
    @frnidito
    @Fullstop000
    @gianm
    @harshpreet93
    @jihoonson
    @jon-wei
    @josephglanville
    @kamaci
    @kanibs
    @leerho
    @liujianhuanzz
    @maytasm
    @mcbrewster
    @mghosh4
    @morrifeldman
    @pjain1
    @samarthjain
    @stefanbirkner
    @sthetland
    @suneet-s
    @surekhasaharan
    @tarpdalton
    @viongpanzi
    @vogievetsky
    @willsalz
    @wjhypo
    @xhl0726
    @xiangqiao123
    @xvrl
    @yuanlihan
    @zachjsh