Monday, May 5, 2008

Search Crawler Custom Workflow

In this post, I will create a custom workflow which crawls an external web url.
The scenario
  • Users need to be able to search the external web sites by adding the url to a List.
  • When a user added an item in the list, the workflow needs to be started to crawl the url so that the link will be searchable.
What the workflow will do:

  • The workflow will first create a content source of type websites and with the url as the start address.
  • Then the workflow will be set to only search the first page of the url by specifying the MaxPageEnumerationDepth to 0 and MaxSiteEnumerationDepth to 0.
  • Finally start crawling the content source created.

Steps:
1. In VS2008, create a project called CustomCrawlerWF using the Sharepoint Server Sequential Workflow template. The feature.xml and Workflow.xml will be created for you in this template.
2. Add References to Microsoft.SharePoint.dll, Microsoft.Office.Server.dll and Microsoft.Office.Server.Search.dll
3. Drag a codeActivity below the onWorkflowActivated1.
4. Double click the codeActivity1, and implement the codeActivity1_ExecuteCode event:

private void codeActivity1_ExecuteCode(object sender, EventArgs e)

{

//******************Step 1: Get the site search context*****************************

string strURL = @"http://sitename/";

SearchContext context;

using (SPSite site = new SPSite(strURL))

{

context = SearchContext.GetContext(site);

}

//******************Step 2: Get the value of the URL column of the List*****************************

String url = workflowProperties.Item["URL"].ToString();

url = url.Substring(0, url.IndexOf(','));

//******************Step 3: Get the site's content sources collection*****************************

Content sspContent = new Content(context);

ContentSourceCollection sspContentSources = sspContent.ContentSources;

if (sspContentSources.Exists(url))

{

//A content source with that name already exists

return;

}

//******************Step 4: Create a new content source and set start address and start full crawl******

WebContentSource webCS = (WebContentSource)sspContentSources.Create(typeof(WebContentSource), url);

webCS.StartAddresses.Add(new Uri(url));

webCS.MaxPageEnumerationDepth = 0;

webCS.MaxSiteEnumerationDepth = 0;

webCS.Update();

webCS.StartFullCrawl();

//******************Step 5: Put the content source into newsletter search scope******

//get hostname

if (url.StartsWith("http://"))

{

url = url.Substring(7, url.Length - 7);

if (url.Contains("/"))

{

url = url.Substring(0, url.IndexOf('/'));

}

url = "http://" + url;

}

else if (url.StartsWith("https://"))

{

url = url.Substring(8, url.Length - 8);

if (url.Contains("/"))

{

url = url.Substring(0, url.IndexOf('/'));

}

url = "https://" + url;

}

Scopes scopes = new Scopes(context);

Scope newsletters = scopes.GetSharedScope("Newsletters");

newsletters.Rules.CreateUrlRule(ScopeRuleFilterBehavior.Include, UrlScopeRuleType.HostName, url);

scopes.StartCompilation();

}


5. Click Deploy option from the Build menu in VS2008.
6. If you got an error saying "Error when you try to edit the content source schedule in Microsoft Office SharePoint Server 2007: "Access is denied",
here is the solution.
7. Go to your list and associate the workflow with the list. Set the workflow to start when a new list item is created.
8. If deploy on production, you could:
  • Copy feature.xml and workflow.xml to the 12 hive features folder.
  • Copy dll to the GAC
  • Under the 12 hive bin folder run:
    • stsadm -o installfeature -n CustomCrawlerWF
    • stsadm -o activatefeature ...

9. Note: if you edit or delete any content sources, when you add an item to a list which set this workflow, the workflow will fail. If you debug it, it will tell you the content source has been modified, so before you add the list item, just do a iisreset.

Reference:
How to: Programmatically Manage the Crawl of a Content Source
SharePoint 2007 Workflows - Writing an Ultra Basic WF

blog comments powered by Disqus