HTTP Request and HTML Parsing in .NET
I’ve been working in college assignment which goal is to integrate several information systems. One of those is a website where we can see the flight schedules from/to Portuguese airports.
The website URL is http://www.innovata-llc.com/ana/default.asp. Me and a colleague of mine had to write a web service that was used by BizTalk as a wrapper of this web site’s form. It was supposed to send an HTTP request passing some request parameters using POST parameters, and to parse the HTML response in order to extract the flight list for a given destination city and departure date.
First of all lets see the code we used to get the response.
Uri address = new Uri(requestURL);
HttpWebRequest request = WebRequest.Create(address) as HttpWebRequest;
request.Method = “POST”;
request.ContentType = “application/x-www-form-urlencoded”;
StringBuilder data = new StringBuilder();
data.Append(“DPT_Date=” + “17-05-2008″);
data.Append(“&RET_Date=” + “20-05-2008″);
data.Append(“&dpt_station=” + “LIS”);
data.Append(“&arv_station=” + “LHR”);
data.Append(“&non_stops=” + “on”);
// Create a byte array of the data we want to send
byte[] byteData = UTF8Encoding.UTF8.GetBytes(data.ToString());
// Set the content length in the request headers
request.ContentLength = byteData.Length;
// Write data
using (Stream postStream = request.GetRequestStream())
{
postStream.Write(byteData, 0, byteData.Length);
}
// Get response
String htmlResponse;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
// Get the response stream
StreamReader reader = new StreamReader(response.GetResponseStream());
// Get the response string
htmlResponse = reader.ReadToEnd();
}
After getting the response content is time to use a nice HTML parser helper library. It is called HTML Agility Pack and it is open-source. You can find it at Codeplex.com/htmlagilitypack.
Now it is time to create an HtmlDocument (a class which ships with HTML Agility Pack) and load the response into this new instance. //Load HTML as XHTML
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlResponse);
After looking to the HTML source code, we got the right Xpath expressions to extract the flight list.
//Get flight lines
HtmlNodeCollection flights = htmlDoc.DocumentNode.SelectNodes(
“//body/div/table[3]/tr[position()>=4 and position()<last()-1]“);
foreach (HtmlNode flight in flights)
{
//Get attributes
string departureTime = flight.ChildNodes[1].FirstChild.InnerText;
string arrivalTime = flight.ChildNodes[3].FirstChild.InnerText;
//Do some stuff …
}
That’s almost it. The next step was adding the extracted information into some data structures, and returning that data. But these details are out of this post scope.






