Creating a Web Page Scraper in C#

Introduction: This tutorial will teach you how to make a web scraper in C#, .NET framework. Theory: Here are the steps we will follow; Get webpage source Disect source Output results Getting the Source: So first we need to get the web page source. Our target URL is going to be the home page of sourcecodester.com. First we create a basic HTTPWebRequest to the site, we then receive the response, and read it to a string which we return to the calling location of the function...
  1. static string getSource()
  2. {
  3.     HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.sourcecodester.com/");
  4.     req.UserAgent = "curl"; // this simulate curl linux command
  5.     req.Method = "GET";
  6.     HttpWebResponse res = (HttpWebResponse) req.GetResponse();
  7.     req = null;
  8.     return new StreamReader(res.GetResponseStream()).ReadToEnd();
  9. }
Disectting the Source: Now that we have the source, we want to disect. As a side note; here is what the main function where we are calling everything from looks like...
  1. static void Main(string[] args) {
  2.         string src = getSource();
  3. }
So first we want to look for patterns in the source. You can either save the webpage in your page and open the saved documents in a text editor on your PC, or you can use a file stream to save the httpresponse from our program. Looking at the source, we can see that all the articles are surrounded by divs with the class of '
'. About three classes in to the div we can see that the one I have selected is a 'node-book', there are other types such as 'source-code' so we are going to use the classes that are used in all the articles only; "
Outputting the Results: All done, now we can simply output the resulting containers...
  1. foreach (string s in articles) {
  2.         Console.WriteLine(s);
  3. }
Of course, this was just a simple demonstration; we could then disect the information further and extract the titles and other pieces of information from the divs. Finished!

Add new comment