Technical Blog
  February 2009
September 2008



 

Technical Blog


  • Web Crawling
    Monday, February 06, 2009
    Posted by Asif Khan





Web Crawling
This Article deals with the Crawling Stuff broadly introduced by Google. In the article you will find Web Crawling developed in ASP.NET using programming language C#.

Points of Interest
The Data received can be use in lot many ways depends on the individuals approach. Any Text, Image or Data can be extracted from the string received.

Using the code
In order to successfully develop a code to crawl web using ASP.NET I have used some of the .NET Framework 2.0 namespaces, as listed below:
  1. System.Net
  2. System.IO
  3. System.Text
System.Net is the backbone of the Crawling Code. In the System.Net name space we find WebClient class, in this class we a wide range of Web Access Methods one of which is DownloadData() it takes URL to be Crawled as a parameter. The method returns a byte Array which is UTF8 Encoded. We receive the method returns in a byte[] variable and finally decode the array using the UTF8Encoding class. The UTF8Encoding object has a GetString() method which takes the byteArray a feed and returns’ the entire page for the requested URL as string.

The Crawling Code is available below:

1 public string crawlData(string url)
2 {
3 byte[] aReqHTML;
4 string myString = null;
5 UTF8Encoding utf8 = new UTF8Encoding();
6 ArrayList a = new ArrayList();
7 WebClient objwc = new WebClient();
8 try
9 {
10 aReqHTML = objwc.DownloadData(url);
11 myString = utf8.GetString(aReqHTML);
12 }
13 catch (Exception ex)
14 {
15 Response.Write("<b>Message :</b>" + ex.Message.ToString() +
"<br><b>Source</b> :" + ex.Source.ToString().Trim());
16 }
17 finally
18 {
19 objwc.Dispose();
20 }
21 return (myString);
22 }