-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 
Author Message
 Post subject: Mass index with Elasticsearch
PostPosted: Sun Jul 02, 2017 4:36 pm 
Newbie

Joined: Fri Jun 16, 2017 5:26 pm
Posts: 7
Trying Hibernate Search 5.8 beta 3 and Elasticsearch 5.2.2.

Attempting to perform mass index on around a million records and finding that it either takes a very long time (as much as an hour) or runs out of memory (xmx at 2048 GB).

Started with the default approach (fullTextSession.createIndexer().startAndWait()) then played with different settings for batch size and thread count but am still observing the same results. Using a progress monitor I observe long gaps of time (minutes) where there no callbacks occur and CPU usage is low - perhaps due to throttling or I/O?

Looking at the retained heap leads to believe that work is being queued up to be pushed into the index (as opposed to waiting on pulling data out of the database). Not sure if I was getting out of memory error due to a direct memory leak or full GC events not being able to keep up.

Class Name | Shallow Heap | Retained Heap | Percentage
------------------------------------------------------------------------------------------------------------------------------------------------------------------
org.hibernate.search.elasticsearch.processor.impl.ElasticsearchWorkProcessor$AsyncBackendRequestProcessor @ 0x94911e00| 32 | 882,363,120 | 42.57%
|- org.hibernate.search.backend.impl.lucene.MultiWriteDrainableLinkedList @ 0x913a16f0 | 24 | 882,363,072 | 42.57%
------------------------------------------------------------------------------------------------------------------------------------------------------------------

We never explicitly set hibernate.search.default.worker.execution but I suppose that isn't relevant in this case as I presume a mass index operation queues up work asynchronously. Is there a configurable option to throttle the producer to limit the size of the queue? Also, I read that there is room for improvement in the performance of mass indexing and Elasticsearch. Is there active work happening in that area? Is there any recommended configuration settings to use on the Elasticsearch or hibernate search side in order to optimize mass indexing?

Our production use case would involve hundreds of millions (or perhaps over a billion) records. For this particular use case would it be preferable to look to a less "bleeding edge" approach and resort to JGroups/Infinispan as opposed to Elasticsearch?


Top
 Profile  
 
 Post subject: Re: Mass index with Elasticsearch
PostPosted: Tue Jul 04, 2017 4:00 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi,
thanks for testing the beta!

Yes you're right on your expectations around async work, unfortunately the work on internal performance - especially on how we use the async capabilities of the driver - is still a work in progress.

Two issues you might want to monitor / get involved with:
- https://hibernate.atlassian.net/browse/HSEARCH-2764
- https://hibernate.atlassian.net/browse/HSEARCH-2455

We're definitely going to work on the first one; the second one is a "maybe / to be explored". I expect we'll see some prototypes and active testing to show up on github this week and/or next week, feel free to peek: https://github.com/hibernate/hibernate-search/pulls

The MultiWriteDrainableLinkedList you see is coming from the sequential (& blocking) backend, which needs to be replaced - especially when using the MassIndexer.

Quote:
Our production use case would involve hundreds of millions (or perhaps over a billion) records. For this particular use case would it be preferable to look to a less "bleeding edge" approach and resort to JGroups/Infinispan as opposed to Elasticsearch?


The JGroups/Infinispan approach has been tested more so you might be able to set it up already without having to wait (or help) with the above mentioned issues; however the JGroups backend is very limited (and experimental) so I'd rather suggest the JMS/Infinispan combination. The JMS backend alternatives are complex to setup though as you'll have to configure master election and details will depend on the JMS vendor and capabilities.

My hope is that the Elasticsearch integration will make this much easier to setup, so if you can wait and/or are willing to help testing the next beta that should be a better investment.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Mass index with Elasticsearch
PostPosted: Thu Jul 06, 2017 4:53 pm 
Newbie

Joined: Fri Jun 16, 2017 5:26 pm
Posts: 7
Hi Sanne,

Thank you very much for your reply. We're continuing to test the Elasticsearch integration in hopes that it will be sufficient given the complexity that an infinispan/JMS integration would add to our multi-node architecture.

Here is the proposed strategy to mitigate the long indexing time when migrating to the new version that leverages Hibernate Search:
- Take one of the application nodes out of the load balancer configuration
- Use that node to initiate a mass reindex
- Other nodes continue to run but federated search queries will use standard hibernate criteria queries as opposed to full text queries that use the index.
- When indexing completes, put the indexing node back into the load balancer configuration and allow full text queries to resume

My main concern with this approach is whether or not there is potential to lose data if a document is modified and then subsequently indexed. Please forgive my ignorant question: Does Hibernate Search implicitly handle this case using a version field? If not, we might need to implement our own mass indexer that is aware of the version number on the records it's indexing.

My other concern now is the latency that indexing will introduce to our hibernate transactions (particularly with hibernate.search.default.elasticsearch.refresh_after_write set to true). Obviously we'll need to run some performance tests to assess whether this is a valid concern. The classes we are indexing are less write heavy than some of our other classes so this latency may be an acceptable trade-off.


Top
 Profile  
 
 Post subject: Re: Mass index with Elasticsearch
PostPosted: Tue Jul 11, 2017 7:42 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi,
Quote:
Please forgive my ignorant question: Does Hibernate Search implicitly handle this case using a version field? If not, we might need to implement our own mass indexer that is aware of the version number on the records it's indexing.


That's not at all dumb question ;)

Hibernate Search doesn't use versioning but while a MassIndexer is running, any new changes triggered by a running transaction are enqueued in the same strictly ordered queue of work generated by the MassIndexer.

This implies you definitely can run "normal" transactions in parallel with the MassIndexer, but only as long they are the same indexer node. So if you use the JMS approach, you can disconnect the master from receiving web traffict to help it but still let it be the master node for the other nodes as well. (If you enable two separate "masters" then you get in trouble).

Are you sure you want to re-implement all full-text queries using Hibernate Criteria to fill the gap? An alternative would be to temporarily stop re-synching the index, or avoid refreshing the IndexReaders, so that the client nodes can use the outdated index until the replacement is ready. Of course this only works if you are ok to use an index which is potentially out of date by some hours, but at least you won't be hammering your RDBMS with complex queries while it's also serving the massindexer.

Quote:
My other concern now is the latency that indexing will introduce to our hibernate transactions (particularly with hibernate.search.default.elasticsearch.refresh_after_write set to true).


I would not suggest that. Elasticsearch is really not designed to use refresh_after_write all the time, this would severly impact its performance. To be entirely fair we primarily implemented support for this operation mode to make it easier to run integration tests, but it's highly recommended to consider the search engine as a service which is possibly slightly out of date.

Of course feel free to experiment - you might have specific requirements and you might have no choice - but I agree this would be a point of concern which requires significant testing with complete data sets.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.