I am developing and migrating a new document store for a client that is moving files from SharePoint On-Premises (2010) to Office 365. As part of the new solution we are planning on being heavy users of search. It is therefore not surprising that on every step of the way I have been testing search. I was bogged down for 3 weeks focused on an issue: Search was not finding my files. Almost every day I met with Microsoft Support and they were stymied, albeit it was over the Christmas and New Years holidays.
I am writing this post to explain what I found and hope that I can save you some time. My work was focused on search being able to find files. Interestingly enough, my friends and colleagues Marc Anderson and Julie Turner were working on a similar issue and they helped me figure out the issue. In their case they were focused on lists and they were using REST calls, while I was using the SharePoint Search Center. Marc has written his findings up and they can be found here.
Before I explain the solution, let me provide a summary of the issue.
The Issue
I have a set of old resumes that I use for testing search. They have some unique things in them that I can search for. I have doctored them up over the years so that they are particularly good files to use for search testing. In particular, in some of them I have the value for pi appearing (3.1415926) and in others I have the value for e (2.178.) Some files I have been named Resume 3.1415926 as well as Resume 2.718.
Now when I search for "3.1415926," I expect to find all of the various versions of my resumes that have the number embedded in it or as part of the file name. When I performed a search for these files using the search center, I always got back a single result. If I searched for these files using the local library search I got back all the files.
As you can see in the pictures below, the screenshot on the left shows the results when I search for 3.1415926 in the local library. I find all four files. While in the search center I find only a single file.
s |
I tested this same scenario on 4 completely independent Office 365 tenants, and Microsoft Support uploaded the files to their own test environment. In all cases we got exactly the same results.
Now to make things more interesting, if I created completely new files and embedded 3.1415926 in the new files, search finds all variations of these files.
The Solution
After weeks of investigation, we found out what is going on. The Microsoft Support staff that I worked with had no idea about this. The search engine has a concept of "duplicate results." I was unable to find any documentation about duplicate results. The search engine has some concept of duplicate results, but it is not documented how they decide on duplicates.
If you look at the Search Results web part and click on Change Query, you will be able to set options on duplicates.
On the dialog box that pops up, click on the Settings tab. The last choice on the page is "Remove Duplicates." By default, all SharePoint search results are set to "Remove duplicates." This is the root cause of our problem. It is a mystery what Microsoft's Search Engine considers as duplicate files. What has become clear to me is that files that started their lives as older formats (Word 2003) and have remained in these formats, or have been upgraded, are all being confused by the search pipeline as being duplicates.
When I change the Remove Duplicates option to "Don't remove duplicates," you can see that I now get all the results.
The most interesting point in all of this for me is that Microsoft support didn't understand it as a setting, but instead as a bug.
I've sent a link to your post to some people on the search team at Microsoft. Maybe they can tighten up that knowledge base.
M.
Posted by: Sympmarc | January 11, 2017 at 06:09 AM
Wow! Really interesting find. Somehow I've always missed this option in the query too. Thanks for posting!
Posted by: Oliver Bartholdson | January 11, 2017 at 06:29 AM
I found this description of the process:
During content processing, for every item being processed, FAST Search for SharePoint will obtain the value of title and the first 1024 bytes of body for this item, and use it to compute a numerical checksum that will be used as a document signature. This checksum is stored in the property documentsignature for every item processed.
During query time, whenever “Remove Duplicate Results” is enabled, the Search Center tells FAST Search for SharePoint to collapse results using the documentsignature property, effectively eliminating any duplicates for items that have the same title+first-1024-bytes-of-body.
When a user clicks on the “Duplicates (n)” link next to an item that has duplicates, another query is submitted to FAST Search for SharePoint, passing as an additional parameter the value of the fcoid managed property for the item selected, which will be used to return all items that contain the same checksum (“the duplicates”).
Posted by: Oliver Bartholdson | January 13, 2017 at 06:55 AM
Great post Marcel as always.
So its a hangover from the FAST days. 1024 bytes seems a little low to base that checksum on though.
Posted by: Nic Betts | January 13, 2017 at 12:35 PM
Thanks for sharing this- good information! Keep it up the great work, we look forward to reading more from you in the future!
Posted by: Venkat | May 17, 2017 at 02:39 AM