Choosing best Stemmer for your Solr Collection

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. We use different filters in Solr to apply stemming. Each stemmer differs in number of scenarios it can cover. For one of my project we have tried to create a matrix to make decision. It can help you to take decision.

Keywords English Minimal KStem Snowball Porter Hunspell
“develop” vs “developer” N N Y Y Y
“Design” vs “Designer” N N Y Y N
“time” vs “timer” N N N N Y
Analyst vs Analysts Y Y Y Y Y
analyst vs analysis N N N N N
“Experience” vs “Experienced” N N N N Y
“tech” vs “technology” N N N N N
“bio” vs “biology” N N N (Y vs biological) N N
“pharma” vs “pharmaceutical” N N N (Y vs pharmaceutics) N N
“french” vs “France” N N N N N
relate vs relation N N Y Y Y
fail vs failure N N N (Y vs failed) N N
talk vs talked N Y Y Y Y
talk vs talking N N (Eat, Eating Y) Y Y Y
separate vs separator N N Y Y N

Details about each stemming filter can be found here :

Broadly they are categorized as aggressive and minimal stemmers. Aggressive are those which covers more scenarios and while minimal are those covering less scenarios. There is no clear cut way to show which one to use. It depends completely on kind of data and context. So make use of above matrix. Note that using aggressive stemmer might result in some unrelated results. For example consider : Organization, Organic. Aggressive stemmers reduces both of them to same root form. So if you search for Organization, you might have results containing Organic word. So you got to see what is needed in your context and other filter queries which can remove the junk. Thanks Luis for this matrix suggestion which helped to come up with right stemmer.


