The same full-text search engine for different products. Apache Luke.

- May 08, 2020

Hi, awesome community!

In this article, you can read info about the tool which helps me with search index investigation.

At the moment, I do an investigation with the Russian language stemming and reuse the morphology analysis by reusing existing libraries.

So, as I understand, today we will speak about Apache Lucene index and small awesome utility Apache Luke.

Because this search engine library is used for full-text search in the Apache Lucene, Solr, Elasticsearch as well.

It means Jira, Confluence, Bamboo on-premises solution used Lucene, Bitbucket used the Elasticsearch. About Cloud, I imagine the Atlassian team used Elasticsearch as it scales easier even Apache Lucene local index. e.g. for Lucene, you need to use for the replication (lucene-replicator - https://lucene.apache.org/core/7_4_0/replicator/org/apache/lucene/replicator/Replicator.html)

or just use Elasticsearch.

Let’s use Apache Luke for the Confluence search indexes:

I just copied indexes from ${CONFLUENCE_HOME}/index/ to the temp directory
Then download the old release https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4.1 for the investifating the Confluence index, for the Jira indexes you can download from Apache Lucene web site ( https://lucene.apache.org/core/downloads.html ) as Jira used the new version.

As example you will see like this logs:

[2020-03-08T18:04:08,397]  WARN (IndexUtils.java:86) - Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="/Users/gonchik.tsymzhitov/temp/lucene/META-INF/112/edge/segments_1"))): 0 (needs to be between 7 and 9). This version of Lucene only supports indexes created with release 6.0 and later.

org.apache.lucene.index.IndexFormatTooOldException: Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="/Users/gonchik.tsymzhitov/temp/lucene/META-INF/112/edge/segments_1"))): 0 (needs to be between 7 and 9). This version of Lucene only supports indexes created with release 6.0 and later.

Therefore I recommend you use luke-4.10.1 and run the luke.sh or luke.bat. And don’t forget tick the option like “Don’t open IndexReader (when opening corrupted index)” and “Force unlock, if locked”.

After that you will see stats of your index. e.g.
As next one you can see the see the contentBody field used for the Confluence content, and next table you will see the top of terms.

Than on tab doc, if you double click on top, you will see in which documents you can find the top field “сво”.

If you click explain structure, you can find the cause of that rules https://confluence.atlassian.com/doc/confluence-search-syntax-158720.html

That’s all for today.

Conclusion:

Happy to see the Apache Luke tool in the Apache Lucene binary builds as built-in tool.
That tool helps me understand how does our search and stats work for Atlassian products, and for many other Lucene-based full-text search projects.
Hope, once Atlassian team, will upgrade the Lucene libraries for the Confluence https://jira.atlassian.com/browse/CONFSERVER-57452 Feel free to click vote if you want to see improvements in Confluence tokenising, stemming, search, ranking functionality.
Once, we can use the morphological analysis as additional functionality option of products. e.g. Russian morphology (https://github.com/AKuznetsov/russianmorphology )

Cheers,

Gonchik Tsymzhitov

Search This Blog

Change is required. Atlassian engineer's blog

The same full-text search engine for different products. Apache Luke.

Comments

Post a Comment

Popular posts from this blog

Removing a Field in Jira Can Improve Request Processing Speed by Up to 30 Percent

Before Cleaning Up Your K8s Image Registry: How to Identify and Preserve Used Images

Unveiling the VMProtect-devirtualization Project: A Review that project