SelectNodes / GetElementbyId

Topics: Developer Forum, User Forum
Nov 14, 2010 at 10:00 PM
Edited Nov 14, 2010 at 10:01 PM

Hi, I'm getting strange results when using either SelectNodes or GetElementbyId.

After Loading Facebook's photos page: http://www.facebook.com/username?v=photos

to an HtmlDocument object using LoadHtml

i'm trying to get the first table on the page as such: HtmlNode table = pics_page.DocumentNode.SelectNodes("//table")[0];

but SelectNodes doesn't seems to find the table i'm after, whys that?

Also I've tried using GetElementbyId couple of times on numerous html sources and i'm not sure why it won't find the element i'm after, i've found the following which shad some light http://stackoverflow.com/questions/2385840/how-to-get-all-input-elements-in-a-form-with-htmlagilitypack

But it didn't solved my problem, Any idea why GetElementbyId fails to find the element with the given id?
Thank you.

Nov 19, 2010 at 7:46 AM

Did you logged in to site? Set debug breakpoint and look to html source - this will show where is problem.

Nov 19, 2010 at 10:27 AM

Heres the html of the image which i'm trying to get a hold of:

Escaped:
<img class=\"logo img\" src=\"http:\/\/profile.ak.fbcdn.net\/hprofile-ak-snc4\/hs447.snc4\/IMAGE_FILE_NAME.jpg\" alt=\"USER_NAME\" id=\"profile_pic\" \/>

Unescape:

<img class="logo img" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/hs447.snc4/IMAGE_FILE_NAME.jpg" alt="USER_NAME" id="profile_pic" />

As you can see the original string is escaped, I've tried removing all html escaping and still couldn't get my image element.
the html is retrieved using the following code:
public string GET(Uri address)
        {
            
            webRequest = WebRequest.Create(address) as HttpWebRequest;
            webRequest.CookieContainer = cookies;
            webRequest.Method = "GET";
            webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7";

            StreamReader sr = null;
            WebResponse webResponse = null;
            string response = "";
            try
            {
                webResponse = webRequest.GetResponse();
                sr = new StreamReader(webResponse.GetResponseStream());
                response = sr.ReadToEnd().Trim();
            }
            catch (WebException ex)
            {

            }
            finally
            {
                if (webResponse == null)
                {
                    webResponse.Close();
                }
                if (sr != null)
                {
                    sr.Close();
                }
            }
            return response;
        }

Where the uri is Facebook's profile page: http://www.facebook.com/profile.php?id=user_id
Then the response gets loaded into an HtmlDocument object using the LoadHtml method.
I'm quite sure the problem is with facebook's html.

Thank you,
Nov 24, 2010 at 7:59 AM

I don't have facebook account (it is poison:-)) ) So I can't be 100% sure, and can't test query and look to html, but somewhy I don't believe that html in facebook is mallformed. I repeat my question: did you logged in with your code to facebook? If not, you won't get any pictures and etc (I presume, that pictures is in private area). When you visit facebook with IE or FF you may think, that everything works straight from beginning, but you forgetting, that your browser can have cookie, which stores login information (session id), which was created a weeks ago. Htmlagilitypack doesn't have this login information, so you need login, get cookie and only then you will have possibility to traverse profile information, look for friends and etc.

Simple sample:

<html>
<head></head>
<body>
<img class=\"logo img\" src=\"http:\/\/profile.ak.fbcdn.net\/hprofile-ak-snc4\/hs447.snc4\/IMAGE_FILE_NAME.jpg\" alt=\"USER_NAME\" id=\"profile_pic1\" \/>
<img class="logo img" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/hs447.snc4/IMAGE_FILE_NAME.jpg" alt="USER_NAME" id="profile_pic2" />
</body>
</html>

HtmlDocument doc = new HtmlDocument();
doc.Load("Sample/Sample1.html");
if (doc.DocumentNode.SelectSingleNode("//*[contains(@id, 'profile_pic1')]") != null) Console.WriteLine("Match1"); ;
if (doc.DocumentNode.SelectSingleNode("//*[@id='profile_pic2']") != null) Console.WriteLine("Match2");
if (doc.GetElementbyId("profile_pic2") != null) Console.WriteLine("Match3");

In html shouldn't be escaped \" symbols.
HtmlAgility pack makes them to (from debugger):
+        Value    Name: "id", Value: "\\\"profile_pic1\\\""    HtmlAgilityPack.HtmlAttribute

But I repeat myself: i don't believe that facebook uses mallformed html.