Choosing best Stemmer for your Solr Collection

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. We use different filters in Solr to apply stemming. Each stemmer differs in number of scenarios it can cover. For one of my project we have tried to create a matrix to make decision. It can help you to take decision.

Continue reading

Nested JSON Objects with Solr

We use Solr for storing different types structured data. Solr works fine and feels intuitive to use as long as structured entity has all properties of basic types like string, number, date etc. But the moment we like to index an entity with relations (which is quite common), intuitiveness of the response will need to be compromised with. Some teams have different strategy to take care of this. We have tried different approaches and settled with a custom response writer along with a naming convention in schema. Yes, those who has to work with dynamic schema or schemaless, following wont help.

Continue reading

Git best practices #BackToBasics

Git has taken source control systems to a new era. I have started my career with Subversion but eventually moved to git. Following are some of the practices which helped me to manage my code well. Most of these are basic guidelines but while listing them here I have realized I am skipping some of them. Its always good have a relook at basics:).

These git basic practices are like TDD and Agile which everyone know but miss practicing because of various reasons which I would leave it to reader😉.

Branching practices

  • Pick a branching model, which can make sure
    • Developer can switch working on diff features easily. It shouldn’t be like commenting code while you are working on a feature, if you needed to work on something else.
    • One place where features developed by teammates can be merged and tested
    • One branch which always contains code running in production so that we can fix production issues immediately
    • Can accommodate multiple teams with different deadlines to collaborate
  • Create branch for every story which might take more than half day. Instead of asking “why a new branch?” lets ask “why not a new branch?”. If not enough reason, go ahead create a branch.
  • Do clean/delete the branch once done with that feature.
  • Name branch with issue number if you are using any backlog tool. This is helpful for team member to quickly see which branch is for what.

Git commits

  • Try not to use “git add .”. Add one file at a time after checking diff
  • Regular committs will make sure you have less files to check and commit
  • Make sure your commit message contains issue number you are working on.
  • Commit ONLY related files together
  • Have your .ignore file updated with all files which need not be committed. When we type “git status”, we shouldn’t see any file which is yet to be committed. Sometimes this needs some work in code to make sure local profile is available.
  • Tag commits with releases, major milestones

Collaboration

  • git is for source code not for binary files. Do not check-in artifacts. Make use platform specific distribution strategy.
  • If you need a minor change in repository of another team, instead of sending mail with lines of code to update, make change and send a pull request
  • At least every day/half day, pull changes from integration branch

Repository level

  • If a project is started after successful POC, create a new repository for project instead of reusing PoC repository
  • Make sure all repositories are in one of the related organization but not under personal repos
  • Sometimes based on context and team structure, you might like to have different repositories for different components. Especially when some of the components are shared components.
  • Its always good to give enough thought to give a good name to repository. Name of the repo should give quick glimpse of what it contains. Avoid acronyms unless they are famous. No need to include portfolio name as organization already reflects it.
  • README.md file is MUST with
    • Some description about project
    • URLs for the application
    • Important contacts
    • Servers used by code in this repo
    • Documentation link
  • Any code (code in artifact, properties, config etc) must be in source control.

Fix : Issue with getting slide title using apache POI

Indexing office files is one of the common case while developing search applications. In my case I needed to index slides of a presentation in which title and content needs to indexed separately as we need to provide high boosting for title. But while extracting title for slides using XSLFSlide objects method getTitle(). Title is not getting returned for many slides.

Continue reading

Solr : Same configuration files for master/slave for every environment

A solr core needs so many configuration files. They look different for master and slave. Also some files like data-config.xml are different for environments when you have data sources to index from. With these permutations, you might end up with large sets of config files. Here I have explained how I managed maintain only single set for all servers and all environments.
Continue reading

TIP : Having issues with loading classpath resources using java ?

I almost spent a day trying to understand why my unit test case was able to load a resource where as once I deploy the jar failed to load the same. In the end I came across something very basic : Some.class.getResource behaves differently than Some.class.getClassLoader.getResource. There is no simple answer like use this always. It depends on which class loaded loading that particular class. So if you are facing such issue try checking both options.