Introduction:
In this tutorial I will be showing you how to create a webpage scraper in Visual Basic. This can be used to gather information from certain websites through an automated process.
Steps of Creation:
Step 1:
First we want to create a form with a simple button (set the name to scrapeButton), a Text Box (set the name to linkURL), a Rich Text Box (set the name to srcBox) and a Web Browser (set the name to srcBrowser). The button will begin the process of grabbing the source from the given page in srcURL Text Box, the source will get put in to srcBox and then srcBrowser will display the srcBox Text.
Step 2:
Before we start scripting we need to import two namespaces. One for connecting to a website and another for reading the source;
Imports System.Net
Imports System.IO
Step 3:
For our first script we will be putting it in-between the button click event;
Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
'In Here
End Sub
Step 4:
The first part of the script is to ensure that the entered URL is genuine and in the correct format;
If (Not linkURL.Text = Nothing) Then
linkURL.Text = linkURL.Text.ToLower()
If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
If Not (linkURL.Text.StartsWith("www.")) Then
If (linkURL.Text.StartsWith("http://")) Then
linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
Else
linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
End If
End If
End If
ElseIf (linkURL.Text.StartsWith("www.")) Then
linkURL.Text = "http://" & linkURL.Text
Else
linkURL.Text = "http://www." & linkURL.Text
End If
End If
Step 5:
The next part of the script is the main part of this tutorial and will deal with getting the source of a web page. First we send a connection request;
Dim req As HttpWebRequest = new HttpWebRequest.create(linkURL.Text)
Step 6:
Once we have sent the request we can read the response and put that in to a HttpWebResponse variable
Dim res As HttpWebResponse = req.GetResponse()
Step 7:
Next, we can read the response and essentially turn the response in to text. We do this by using a StreamReader from our System.IO Namespace Import;
Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()
Step 8:
Finally we can simply set the Rich Text Box (srcBox) text value to our web page source (src) and turn the Web Browser (srcBrowser)'s DocumentText in to our source - this is just for testing purposes to see if we get a resemblance between the website we want to scrape and the website source code we are receiving from the response;
srcBox.Text = src
srcBrowser.DocumentText = srcBox.Text
Test:
As you can see from the below image, after testing our program out on http://www.google.com we received the source code in our srcBox and our srcBrowser is displaying correctly. Great!
Extracting Data:
Step 1:
To make our data extraction easier, we are going to use a single function. This function uses Regular Expression to extract a certain String from another, larger String. Add this to your source code:
Private Function GetBetween(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String, Optional ByVal Index As Integer = 0) As String
Return Regex.Split(Regex.Split(Source, Str1)(Index + 1), Str2)(0)
End Function
To make Regex work, ensure you add the Import at the top of your source code file:
Imports System.Text.RegularExpressions
Step 2:
Now we can extract easily from our source code. The following code will simply out the word "Feeling" by extracting a SubString from our source code which starts straight after "I'm " and just before " Lucky" which is the text clearly shown on the "I'm Feeling Lucky" search button:
Dim extracted As String = GetBetween(src, "I'm ", " Lucky")
MsgBox(extracted)
Project Completed!
That's it! Here is the finished source code:
Imports System.Net
Imports System.IO
Public Class Form1
Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
If (Not linkURL.Text = Nothing) Then
linkURL.Text = linkURL.Text.ToLower()
If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
If Not (linkURL.Text.StartsWith("www.")) Then
If (linkURL.Text.StartsWith("http://")) Then
linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
Else
linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
End If
End If
End If
ElseIf (linkURL.Text.StartsWith("www.")) Then
linkURL.Text = "http://" & linkURL.Text
Else
linkURL.Text = "http://www." & linkURL.Text
End If
Dim req As HttpWebRequest = HttpWebRequest.Create(linkURL.Text)
Dim res As HttpWebResponse = req.GetResponse()
Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()
srcBox.Text = src
srcBrowser.DocumentText = srcBox.Text 'src
End If
End Sub
End Class