Weaviate v0.22.15 Release Notes

Release Date: 2020-08-28 // over 1 year ago

  • 🐳 Docker image/tag: semitechnologies/weaviate:0.22.15
    👀 See also: example docker compose files in English, German, Dutch, Italian and Czech.

    💥 Breaking Changes

    none

    🆕 New Features

    Optional Compound Splitting in Contextionary

    Motivation

    Sometimes Weaviate's Contextionary does not understand words which are compounded out of words it would otherwise understand. This impact is far greater in languages that allow for arbritrary compounding (such as Dutch or German) than in languages where compounding is not very common (such as English).

    Effect

    🚚 Imagine you import an object of class Post with content This is a thunderstormcloud. The arbitrarily compunded word thunderstormcloud is not present in the Contextionary. So your object's position will be made up of the only words it recognizes: "post", "this" ("is" and "a" are removed as stopwords).

    👀 If you check how this content was vectorized using the _interpration feature, you will see something like the following:

    "\_interpretation": { "source": [{ "concept": "post", "occurrence": 62064610, "weight": 0.3623903691768646 }, { "concept": "this", "occurrence": 932425699, "weight": 0.10000000149011612 }] }
    

    To overcome this limitation the optional Compound Splitting Feature can be enabled in the Contextionary. It will understand the arbitrary compounded word and interpret your object as follows:

    "\_interpretation": { "source": [{ "concept": "post", "occurrence": 62064610, "weight": 0.3623903691768646 }, { "concept": "this", "occurrence": 932425699, "weight": 0.10000000149011612 }, { "concept": "thunderstormcloud (thunderstorm, cloud)", "occurrence": 5756775, "weight": 0.5926488041877747 }] }
    

    Note that the newly found word (made up of the parts thunderstorm and cloud has the highest weight in the vectorization. So this meaning that would have been lost without Compound Splitting can now be recognized.

    Trade-Off Import speed vs Word recognition

    0️⃣ Compound Splitting runs an any word that is otherwise not recognized. Depending on your dataset this can lead to a signifcantly longer import time (up to 100% longer). Therefore, you should carefully evaluate whether the higher precision in recognition or the faster import times are more important to your use case. As the benefit is larger in some languages (e.g. Dutch, German) than in others (e.g. English) this feature is turned off by default.

    How to activate

    To turn on compound splitting simply change the environment variable ENABLE_COMPOUND_SPLITTING to true on the contextionary container. For example, on the English language docker-compose files, the variable can be found in this line.

    Classification Perfomance Improvements
    🚀 Prior to this release Classifications (both kNN and contextual) were single threaded and thus not utilizing all available resources. With this release both classification types will use as many threads as CPU cores are available. This can speed up classifications considerably on larger machines.

    🛠 Fixes

    none