Weaviate v0.22.7 Release Notes

Release Date: 2020-04-29 // almost 4 years ago
  • ๐Ÿณ Docker image/tag: semitechnologies/weaviate:0.22.7
    ๐Ÿ‘€ See also: example docker compose files in English, German, Dutch, Italian and Czech.

    ๐Ÿ’ฅ Breaking Changes

    ๐Ÿ†• New Features

    ๐Ÿ‘Œ Improved Contextual classification algorithm (#1125)
    ๐Ÿš€ Prior to this released a contextual classification would often yield false positive for whichever label is closest to the "noise center". This means we would overweigh filler- and stop words and not pay enough attention to the most important words.

    As we compare a data object to its label in a contextual classification, rather data to other data as in a knn-type classification, this issue was far more prevalent in a contextual classification than in one of type knn. In the latter the noise would be present among all data objects, so it was likely to be cancelled out. However, in data objects with (long) texts the contextual classification suffered.

    ๐Ÿš€ This release introduces a complete rewrite of the classification algorithm. Instead of weighing each word purely on it's occurrence in the Contextionary, we know weigh (and even remove) words based on two new metrics: Information Gain and tf-idf.

    ๐Ÿšš Information Gain is a custom measure to predict how likely a given word is going to influence the classification towards a specific target (label). For example imagine the data object "I love my new computer" with the possible labels "Technology", "Food", "Politics". When looking at each word in the source object Weaviate would identify "computer" as the word with the highest information gain as it would clearly move the vector towards one of the categories ("computers"). The other words might point to either of the categories without a clear favorite. Thus their information gain should be lower. As a result weaviate will weigh "computer" the highest in the data object.

    ๐Ÿ”ง Tf-Idf, on the other hand, does not compare the data objects directly to a target (label), but rather to other objects. If multiple objects exist such as "My new computer is great!", "Who is the new president?", "New dishes on the menu!", the word "new"is identified to occur in every object, it thus has an Inverse Document Frequency of 0. Based on user configuration this word can be removed from vectorization entirely.

    ๐Ÿ‘€ The new mechanisms are user-configurable. They come with reasonable defaults that will work for many datasets, but the get the most out of your classification, it might make sense to tweak them until you get the best possible results. For a detailed list and explanation of the newly introduced parameters, see this comment.

    Benchmark

    ๐Ÿ‘€ In a benchmark based on the 20 news group data set we have seen a substantial improvement in success rates:

    Note that this benchmark was done using a contextual classification, i.e. without training data (labeled data). The success rates are therefore not comparable to other mechanisms which rely on training data. If you want to compare Weaviate's perfomance with other classifications mechanisms which require labelled data, please run a kNN classification instead.

    Main Category

    The posts were to be categorized as one of 6 categories (expected success rate for random distribution ~16,7%)

    Granular Category

    The posts were to be categorized as one of 20 categories (expected success rate for random distribution ~5%)

    Goal Previous (<0.22.7) Improved Algorithm (>= 0.22.7)
    Main Category 18% 58%
    Granular Category 10% 42%

    The following settings were used:

    # datasetn: 563 # randomly picked with a roughly equal size per category# configuration type: contextualinformationGainCutoffPercentile: 10informationGainMaximumBoost: 3tfidfCutoffPercentile: 80
    

    ๐Ÿ›  Fixes

    • ๐Ÿ›  Fix unexpected behavior on geoCoordinates 0,0 (#825)
      ๐Ÿš€ GeoCoordinates of 0,0 - infamously known as Null Island - would lead to the geoCoordinates property disappearing entirely as 0 also happens to be the null/initial value for a property of type float in Golang. This release fixes this and we explicitly display a 0-Coordinate as such now.