Creating a Web Page Scraper in C#

Submitted by Yorkiebar on Tuesday, July 8, 2014 - 05:16.

Introduction: This tutorial will teach you how to make a web scraper in C#, .NET framework. Theory: Here are the steps we will follow; Get webpage source Disect source Output results Getting the Source: So first we need to get the web page source. Our target URL is going to be the home page of sourcecodester.com. First we create a basic HTTPWebRequest to the site, we then receive the response, and read it to a string which we return to the calling location of the function...

static string getSource()
{
    HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.sourcecodester.com/");
    req.UserAgent = "curl"; // this simulate curl linux command
    req.Method = "GET";
    HttpWebResponse res = (HttpWebResponse) req.GetResponse();
    req = null;
    return new StreamReader(res.GetResponseStream()).ReadToEnd();
}

Disectting the Source: Now that we have the source, we want to disect. As a side note; here is what the main function where we are calling everything from looks like...

static void Main(string[] args) {
        string src = getSource();
}

So first we want to look for patterns in the source. You can either save the webpage in your page and open the saved documents in a text editor on your PC, or you can use a file stream to save the httpresponse from our program. Looking at the source, we can see that all the articles are surrounded by divs with the class of '

'. About three classes in to the div we can see that the one I have selected is a 'node-book', there are other types such as 'source-code' so we are going to use the classes that are used in all the articles only; "

Outputting the Results: All done, now we can simply output the resulting containers...

foreach (string s in articles) {
        Console.WriteLine(s);
}

Of course, this was just a simple demonstration; we could then disect the information further and extract the titles and other pieces of information from the divs. Finished!

Add new comment

124 views