I am developing and migrating a new document store for a client that is moving files from SharePoint On-Premises (2010) to Office 365. As part of the new solution we are planning on being heavy users of search. It is therefore not surprising that on every step of the way I have been testing search. I was bogged down for 3 weeks focused on an issue: Search was not finding my files. Almost every day I met with Microsoft Support and they were stymied, albeit it was over the Christmas and New Years holidays.
I am writing this post to explain what I found and hope that I can save you some time. My work was focused on search being able to find files. Interestingly enough, my friends and colleagues Marc Anderson and Julie Turner were working on a similar issue and they helped me figure out the issue. In their case they were focused on lists and they were using REST calls, while I was using the SharePoint Search Center. Marc has written his findings up and they can be found here.
Before I explain the solution, let me provide a summary of the issue.
I have a set of old resumes that I use for testing search. They have some unique things in them that I can search for. I have doctored them up over the years so that they are particularly good files to use for search testing. In particular, in some of them I have the value for pi appearing (3.1415926) and in others I have the value for e (2.178.) Some files I have been named Resume 3.1415926 as well as Resume 2.718.
Now when I search for "3.1415926," I expect to find all of the various versions of my resumes that have the number embedded in it or as part of the file name. When I performed a search for these files using the search center, I always got back a single result. If I searched for these files using the local library search I got back all the files.
As you can see in the pictures below, the screenshot on the left shows the results when I search for 3.1415926 in the local library. I find all four files. While in the search center I find only a single file.
I tested this same scenario on 4 completely independent Office 365 tenants, and Microsoft Support uploaded the files to their own test environment. In all cases we got exactly the same results.
Now to make things more interesting, if I created completely new files and embedded 3.1415926 in the new files, search finds all variations of these files.
After weeks of investigation, we found out what is going on. The Microsoft Support staff that I worked with had no idea about this. The search engine has a concept of "duplicate results." I was unable to find any documentation about duplicate results. The search engine has some concept of duplicate results, but it is not documented how they decide on duplicates.
If you look at the Search Results web part and click on Change Query, you will be able to set options on duplicates.
On the dialog box that pops up, click on the Settings tab. The last choice on the page is "Remove Duplicates." By default, all SharePoint search results are set to "Remove duplicates." This is the root cause of our problem. It is a mystery what Microsoft's Search Engine considers as duplicate files. What has become clear to me is that files that started their lives as older formats (Word 2003) and have remained in these formats, or have been upgraded, are all being confused by the search pipeline as being duplicates.