📚 Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.
# New Features
0️⃣ # GroupBy and Timeseries vectorized query engines enabled by default
✅ Vectorized query engines for GroupBy and Timeseries queries were introduced in Druid 0.16, as an opt in feature. Since then we have extensively tested these engines and feel that the time has come for these improvements to find a wider audience. Note that not all of the query engine is vectorized at this time, but this change makes it so that any query which is eligible to be vectorized will do so. This feature may still be disabled if you encounter any problems by setting
👍 # Druid native batch support for Apache Avro Object Container Files
🆕 New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster. Check out the docs for more details
⚡️ # Updated Druid native batch support for SQL databases
👀 An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new native batch ingestion specifications first introduced in Druid 0.17, deprecating the SqlFirehose. Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the driver from those extensions. This is a relatively low level ingestion task, and the operator must take care to manually ensure that the correct data is ingested, either by specially crafting queries to ensure no duplicate data is ingested for appends, or ensuring that the entire set of data is queried to be replaced when overwriting. See the docs for more operational details.
# Apache Ranger based authorization
📚 A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid, backed by Apache Ranger. Please see [the extension documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html) and Authentication and Authorization for more information on the basic facilities this extension provides.
👍 # Alibaba Object Storage Service support
👀 A new 'contrib' extension has been added for Alibaba Cloud Object Storage Service (OSS) to provide both deep storage and usage as a batch ingestion input source. Since this is a 'contrib' extension, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.
👷 # Ingestion worker autoscaling for Google Compute Engine
Another 'contrib' extension new in 0.19.0 has been added to support ingestion worker autoscaling, which allows a Druid Overlord to provision or terminate worker instances (MiddleManagers or Indexers) whenever there are pending tasks or idle workers, for Google Compute Engine. Unlike the Amazon Web Services ingestion autoscaling extension, which provisions and terminates instances directly without using an Auto Scaling Group, the GCE autoscaler uses Managed Instance Groups to more closely align with how operators are likely to provision their clusters in GCE. Like other 'contrib' extensions, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.
REGEXP_LIKEfunction has been added to Druid SQL and native expressions, which behaves similar to
LIKE, except using regular expressions for the pattern.
🌐 # Web console lookup management improvements
🌐 Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.
➕ Additionally, clicking the magnifying glass icon next to a lookup will now allow displaying the first 5000 values of that lookup.
# New Coordinator per datasource 'loadstatus' API
📇 A coordinator API can make it easier to determine if the latest published segments are available for querying. This is similar to the existing coordinator 'loadstatus' API, but is datasource specific, may specify an interval, and can optionally live refresh the metadata store snapshot to get the latest up to date information. Note that operators should still exercise caution when using this API to query large numbers of segments, especially if forcing a metadata refresh, as it can potentially be a 'heavy' call on large clusters.
👍 # Native batch append support for range and hash partitioning
⬇️ Part bug fix, part new feature, Druid native batch (once again) supports appending new data to existing time chunks when those time chunks were partitioned with 'hash' or 'range' partitioning algorithms. Note that currently the appended segments only support 'dynamic' partitioning, and when rolling back to older versions that these appended segments will not be recognized by Druid after the downgrade. In order to roll back to a previous version, these appended segments should be compacted with the rest of the time chunk in order to have a homogenous partitioning scheme.
🛠 # Bug fixes
👀 Druid 0.19.0 contains 65 bug fixes, you can see the complete list here.
# Fix for batch ingested 'dynamic' partitioned segments not becoming queryable atomically
🛠 Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' partitioned segments produced by a batch ingestion task were not tracking the overall number of partitions. This had the implication that when these segments came online, they did not do so as a complete set, but rather as individual segments, meaning that there would be periods of swapping where results could be queried from an incomplete partition set within a time chunk.
# Fix to allow 'hash' and 'range' partitioned segments with empty buckets to now be queryable
🛠 Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning where if data skew was such that any of the buckets were 'empty' after ingesting, the partitions would never be recognized as 'complete' and so never become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the partitioning spec. These changes to the json format should be backwards compatible, however rolling back to a previous version will again make these segments no longer queryable.
# Incorrect balancer behavior
🔧 A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator operation in the event
druid.server.maxSizewas not set. This bug would allow segments to load, and effectively randomly balance them in the cluster (regardless of what balancer strategy was actually configured) if all historicals did not have this value set. This bug has been fixed, but as a result
druid.server.maxSizemust be set to the sum of the segment cache location sizes for historicals, or else they will not load segments.
⬆️ # Upgrading to Druid 0.19.0
🚀 Please be aware of the following issues when upgrading from 0.18.1 to 0.19.0. If you're updating from an earlier version than 0.18.1, please see the release notes of the relevant intermediate versions.
# 'druid.server.maxSize' must now be set for Historical servers
⬆️ A Coordinator bug fix as a side-effect now requires
druid.server.maxSizeto be set for segments to be loaded. While this value should have been set correctly for previous versions, please be sure this value is configured correctly before upgrading your clusters or else segments will not be loaded.
🚚 # System tables 'sys.segments' column 'payload' has been removed and replaced with 'dimensions', 'metrics', and 'shardSpec'
🛰 The removal of the 'payload' column from the
sys.segmentstable should make queries on this table much more efficient, and the most useful fields from this, the list of 'dimensions', 'metrics', and the 'shardSpec', have been split out, and so are still available to devote to processing queries.
0️⃣ # Changed default number of segment loading threads
druid.segmentCache.numLoadingThreadsconfiguration has had the default value changed from 'number of cores' to 'number of cores' divided by 6. This should make historicals a bit more well behaved out of the box when loading a large number of segments, limiting the impact on query performance.
# Broadcast load rules no longer have 'colocated datasources'
🔧 A number of incomplete changes to facilitate more efficient join queries, based on the idea of utilizing broadcast load rules to propagate smaller datasources among the cluster so that join operations can be pushed down to individual segment processing, have been added to 0.19.0. While not a finished feature yet, as part of the changes to make this happen, 'broadcast' load rules no longer have the concept of 'colocated datasources', which would attempt to only broadcast segments to servers that had segments of the configured datasource. This didn't work so well in practice, as it was non-atomic, meaning that the broadcast segments would lag behind loads and drops of the colocated datasource, so we decided to remove it.
🔧 # Brokers and realtime tasks may now be configured to load segments from 'broadcast' datasources
🔧 Another effect of the afforementioned preliminary work to introduce efficient 'broadcast joins', Brokers and realtime indexing tasks will now load segments loaded by 'broadcast' rules, if a segment cache is configured. Since the feature is not complete there is little reason to do this in 0.19.0, and it will not happen unless explicitly configured.
# lpad and rpad function behavior change
0️⃣ The lpad and rpad functions have gone through a slight behavior change in Druids default non-SQL compatible mode, in order to make them behave consistently with PostgreSQL. In the new behavior, if the pad expression is an empty string, then the result will be the (possibly trimmed) original characters, rather than the empty string being treated as a null and coercing the results to null.
# Extensions providing custom Druid expressions are now expected to implement equals and hashCode methods
A change to the
Exprinterface in Druid 0.19.0 requires that any extension which provides custom expressions via
ExprMacroTablemust also implement
hashCodemethods to function correctly, especially with JOIN queries, which rely on filter and expression analysis for determining how to optimally process a query.
# Known Issues
👀 For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.
🚀 Thanks to everyone who contributed to this release!
v0.19.0-rc1July 13, 2020
v0.18.1May 05, 2020
v0.18.1-rc2May 05, 2020
v0.18.1-rc1April 30, 2020
v0.18.0April 16, 2020
v0.18.0-rc3April 16, 2020
v0.18.0-rc2April 15, 2020
v0.18.0-rc1April 12, 2020
v0.17.1March 31, 2020