How To Index Array of Objects in Elasticsearch

Zuzia Kusznir

How To Index Array of Objects in Elasticsearch

Recently, I’ve been playing around with a search in Elasticsearch and got stuck with development when attempting to work with an array of objects. Indexing went fine, the query results, however, did not look as expected.

Elasticsearch is a really powerful search and analytics engine which comes in very handy when you need to perform a text-based search on data collections. But you will only get along with this particular tool as long as you understand some of its specific behaviors—at least that’s what I learned when I found out the power of Elasticsearch’s nested objects.

The examples below are written in Ruby with assistance from the elasticsearch gem.

First There’s an Idea

Let’s suppose you’re running a recruitment agency helping software houses hire developers perfectly matching the requirements for their open positions. In order to simplify the example, the personal details of developers will be limited to their names and skills, including the languages they know along with the level of their proficiency therein. So far, only two developers have registered with your agency. The documents representing developer data can be found below:


[
  {
    name: "John Doe",
    skills: [
      {
        language: "ruby",
        level: "expert",
      },
      {
        language: "javascript",
        level: "beginner",
      },
    ]
  },
  {
    name: "Mark Smith",
    skills: [
      {
        language: "ruby",
        level: "beginner",
      },
    ]
  },
]

 

And Then There’s the Implementation

In order to index documents of such a structure, a given index needs to be created:


client = Elasticsearch::Client.new(log: true)

client.incides.create({
  index: "developers",
  type: "developer",
  body: {
    mappings: {
      developer: {
        properties: {
          name: { type: "text" },
          skills: {
            properties: {
              language: { type: "keyword" },
              level: { type: "keyword"},
            },
          },
        },
      },
    },
  },
})

 Then, a software house approaches your agency and asks for a list of Ruby developers who are just starting their adventure with the language. The search query you run looks as follows:



client.search({
  index: "developers",
  body: {
    query: {
      bool: {
        filter: [
          { match: { "skills.language": "ruby" }},
          { match: { "skills.level": "beginner" }},
        ]
      },
    },
  },
})

 And the results are:


{
  ...,
  "hits": [{
    ...,
    "_source": {
      "name": "John Doe",
      "skills": [
        { "language": "ruby", "level": "expert" },
        { "language": "javascript", "level": "beginner" }
      ]
    }
  },
  {
    ...,
    "_source": {
      "name": "Mark Smith",
      "skills": [{ "language": "ruby", "level": "beginner" }]
    }
  }]
}

These results, however, are not what you expect—the query returns Mark Smith, a beginner in Ruby, as well as John Doe who is an expert in the language. And this gets you thinking…

...What Went Wrong?

At first, you may be blaming the query itself, but it is not the case. The answer for the question in the title above is given in the way Elasticsearch indexes arrays of nested objects for a single document.


{
  "name": "John Doe",
  "skills.language": ["ruby", "javascript"],
  "skills.level": ["expert", "beginner"],
}

The structure of the array of objects has been flattened into arrays containing values for specific fields of objects. So, despite John Doe not being a beginner in Ruby, he was listed in the query result because both of the specified filters(“skills.language”:“ruby” and“skills.level”:“beginner”) are present in his document. Although such behavior speeds up the query, it is not suitable for cases where we would like to preserve the relation between the object’s fields. In order to do so, we need to take advantage of nested objects which are indexed as separate documents within the parent document:



{
  "name": "John Doe",
  {
    "skills.language": "ruby",
    "skills.level": "expert",
  },
  {
    "skills.language": "javascript",
    "skills.level": "beginner",
  },
}

 As you can see, the relation between every nested object’s fields is preserved and you will finally be able to get your expected query results. But, before you can make that happen, you need to mind a couple of implementation outlined below.  

Gimme the Solution

 First of, all you need to do is modify the index’s mapping a little bit:


client.incides.create({
  index: "developers",
  type: "developer",
  body: {
    mappings: {
      developer: {
        properties: {
          name: { type: "text" },
          skills: {
            type: "nested",
            properties: {
              language: { type: "keyword" },
              level: { type: "keyword"}
            }
          }
        }
      }
    }
  }
})

 With type: "nested" (line 10), we define every skill object to be nested within the developer document, which means Elasticsearch will index every object separately. However, not only does the index needs to be modified, but the search query as well (lines 5-6):


client.search({
  index: "developers",
  body: {
    query: {
      nested: {
        path: "skills",
        query: {
          bool: {
            filter: [
              { match: { "skills.language": "ruby" }},
              { match: { "skills.level": "beginner" }}
            ]
          }
        }
      }
    }
  }
})

 The result:


{
  ...
  "hits" => [{
    ...,
    "_source" => {
      "name" => "Mark Smith",
      "skills" => [{ "language" => "ruby", "level" => "beginner" }]
    }
  }
]}

 And there you are—the query above returned exactly what we expected—Mark Smith, a beginner in Ruby. Try it for yourself!

What If I Needed an Aggregation on Nested Objects?

Sure, why not? In a manner similar to the one we used for the search query, we need to insert a nested statement into the aggregation. In case we wanted to find out how many developers code in a specific language, we should define such an aggregation:



client.search({
  index: "developers",
  body: {
    aggregations: {
      skills: {
        nested: { path: "skills" },
        aggregations: {
          language: {
            terms: {
              field: "skills.language"
            }
          }
        }
      }
    }
  }
})

The result:


{
  ...
  "aggregations" => {
    "skills" => {
      "doc_count" => 3,
      "language" => {
        ...,
        "buckets" => [
          { "key" => "ruby", "doc_count" => 2 },
          { "key" => "javascript", "doc_count" => 1 }
        ]
      }
    }
  }
}

 

When Should I Use Nested Objects?

Elasticsearch nested objects are a perfect match for data structures containing collections of inner objects tightly coupled with the outer object and/or describing the outer object. The above-mentioned example of the developer data structure with an inner skills object is a good case for nested objects—what the employers might be most interested in a developer are their skills: languages, experience, proficiency levels, etc., with other developer characteristics playing a lesser role. 

Addendum

If you happen to be using the searchkick gem and need to use nested objects in your application, you may want to reconsider your choice, since Searchkick does not support nested objects yet and it requires a bit of hacking to make it work. In such a case, I’d recommend going with elasticsearch-dsl instead.

Cta image
Zuzia Kusznir avatar
Zuzia Kusznir