Using crawl rule, we can allow or restrict specific paths/URL to be crawled. By doing this, you are making sure only specific and relevant content is crawled by the search engine. You can also specify alternate credential to include the crawl path. There’s much more… Let’s go in details.
What you can do with crawl rules?
- Restrict content from being crawled: e.g. If you want to crawl the Site collection (http://servername/sites/ParentSite) but not subsites (http://servername/sites/ParentSite/subsite), you can create a crawl rule
- Crawl content that might have excluded using the rule: e.g. If you want to exclude crawling the Site collection (http://servername/sites/ParentSite) but want to include subsites (http://servername/sites/ParentSite/subsite), you can create a crawl rule
- Use an alternate account to crawl the content: If any site needs different credentials to be crawled than the default credentials, you can specify the credentials.
- Rule Order: If there are multiple rules matching the URL/Path, you can set the order of the crawl.
Note: The order of the crawl rule is very important, the first rule in order which matches the URL, will be applied to that URL
The step-by-step process of configuring crawl rules in search service application
- Go to Central administration >> Application Management >> Manage service applications
- Click on your Search service application
- On the left-hand side of the page >> click on Crawl Rules
- You will find a window to Manage Crawl Rules. Click on New Crawl Rule to create a Crawl Rule.
- Path: Specify the path over here. The crawl rule is going to affect this path. You can use wild card characters in the path.
Use regular expression syntax for matching this rule: You can use this option to use regular expressions over here instead of wild card characters
- Crawl Configuration: Select this option to decide whether to include or exclude the items in the Content index.
- Exclude all items in this path: Select this option if you want to exclude all items from the path specified. Additionally, you can refine your exclusion criteria using the below option.
- Exclude complex URLs (URLs that contain question marks - ?): The system will exclude the URL from crawling which includes question mark – (?) notation.
- Include all items in this path: This will include all the items matching the path specified above. Additionally, you can refine your inclusion criteria using the below options
- Follow links on the URL without crawling the URL itself: If you select this option, the URLs/pages on the specified path, will not be crawled but all the links placed inside these URLs/pages will be crawled.
- Crawl complex URLs (URLs that contain a question mark -?) If you select this option, the URLs with a question mark (?) will be included as part of crawl process.
- Crawl SharePoint content as http pages: When the SharePoint content is crawled as HTTP pages, item permissions are not crawled/stored. So, there will not be any ‘Permission Check’ when the search results are displayed to the end-user.
- Specify Authentication: Specify the details regarding Authentication which will be set for this crawl rule.
Note: This option will be disabled unless you select Include all items in this path in the above steps.
- Use the default content access account: Default access account will be used to access/crawl the path specified
- Specify a different content access account: You can specify Account and Password if you want to access the content specified in the path with different credentials. You can tick Do not allow Basic Authentication if you want to avoid Basic Authentication. The system will go for NTLM authentication.
- Specify client certificate: For authentication, you can also use a client certificate. You can select the certificate from the drop-down menu.
- Specify form credentials: If you want to use Forms-based authentication, you can specify the URL of the page (form) to enter credentials. When you add form URL and click Enter Credentials, it will open the page and will ask to enter the credentials. You can then add credentials which will be used to access the content on the specified path.
- Use cookie for crawling: If you want to use a cookie for crawling, select Obtain cookie from URL. You can optionally select Specify cookie for crawling to import cookie from the local file system. You can also specify error page in the textbox.
- Anonymous access: Use this option to access the resource on the Path specified above using anonymous access.
Test a Crawl rule on a URL
- From Central Administration, go to Manage Crawl Rules Page.
- Enter the URL and click on the Test button to get the matching rule for that URL
- As shown in the screenshot above, you can see there’s one rule that matches the URL and the matching rule will be placed (*) before the Rule.
- You can also set the order of the rules from the Order column directly.
Conclusion
We went through the step-by-step process of creation of Crawl Rule and testing the site URL against crawl rules. This process was tested in a SharePoint 2013 On-premises environment.