The same full-text search engine for different products. Apache Luke.

 Hi, awesome community!

 

In this article, you can read info about the tool which helps me with search index investigation. 

At the moment, I do an investigation with the Russian language stemming and reuse the morphology analysis by reusing existing libraries. 

So, as I understand, today we will speak about Apache Lucene index and small awesome utility Apache Luke.

Because this search engine library is used for full-text search in the Apache Lucene, Solr, Elasticsearch as well. 

It means Jira, Confluence, Bamboo on-premises solution used Lucene, Bitbucket used the Elasticsearch. About Cloud, I imagine the Atlassian team used Elasticsearch as it scales easier even Apache Lucene local index. e.g. for Lucene, you need to use for the replication (lucene-replicator - https://lucene.apache.org/core/7_4_0/replicator/org/apache/lucene/replicator/Replicator.html)

or just use Elasticsearch.

 

Let’s use Apache Luke for the Confluence search indexes:

image.png

As example you will see like this logs:

[2020-03-08T18:04:08,397]  WARN (IndexUtils.java:86) - Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="/Users/gonchik.tsymzhitov/temp/lucene/META-INF/112/edge/segments_1"))): 0 (needs to be between 7 and 9). This version of Lucene only supports indexes created with release 6.0 and later.

org.apache.lucene.index.IndexFormatTooOldException: Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="/Users/gonchik.tsymzhitov/temp/lucene/META-INF/112/edge/segments_1"))): 0 (needs to be between 7 and 9). This version of Lucene only supports indexes created with release 6.0 and later.
  • Therefore I recommend you use luke-4.10.1 and run the luke.sh or luke.bat. And don’t forget tick the option like “Don’t open IndexReader (when opening corrupted index)” and “Force unlock, if locked”.

image.png

  • After that you will see stats of your index. e.g.image.png
  • As next one you can see the see the contentBody field used for the Confluence content, and next table you will see the top of terms. 

image.png

  • Than on tab doc, if you double click on top, you will see in which documents you can find the top field “сво”.

image.png

image.png

If you click explain structure, you can find the cause of that rules https://confluence.atlassian.com/doc/confluence-search-syntax-158720.html

image.png image.png


That’s all for today. 

 

Conclusion

  1. Happy to see the Apache Luke tool in the Apache Lucene binary builds as built-in tool.
  2. That tool helps me understand how does our search and stats work for Atlassian products, and for many other Lucene-based full-text search projects.
  3. Hope, once Atlassian team, will upgrade the Lucene libraries for the Confluence https://jira.atlassian.com/browse/CONFSERVER-57452  Feel free to click vote if you want to see improvements in Confluence tokenising, stemming, search, ranking functionality.
  4. Once, we can use the morphological analysis as additional functionality option of products. e.g. Russian morphology (https://github.com/AKuznetsov/russianmorphology )



Cheers,

Gonchik Tsymzhitov

Comments

Popular posts from this blog

How only 2 parameters of PostgreSQL reduced anomaly of Jira Data Center nodes

Stories about detecting Atlassian Confluence bottlenecks with APM tool [part 1]

Atlassian Community, let's collaborate and provide stats to vendors about our SQL index usage