removing certain type of images based on extension, width, height

Apr 23, 2010 at 8:58 AM

Hi, I just found this project and I think it is great, it is exactly what I was looking for, however I need your help.

I am getting all the image references to a website however I don't know how to use HAP to filter certain images, for example I don't want gif and if there is a width or height I don't want them unless is bigger than 150px.

Also what is the best way to check for example if the img has a full reference? or relative?

links = doc.DocumentNode.SelectNodes("//img");

        if (links != null)
        {
            foreach (HtmlNode link in links)
            {
                List<string> hrefTags = new List<string>();
                HtmlAttribute att = link.Attributes["src"];
                Response.Write(att.Value);
            }
        }

 

Thanks for your help

Apr 23, 2010 at 1:25 PM

A lot of that you'll have to do yourself.

For finding the height and width you can check the other attributes on the img tag but for the real dimensions you'll have to do it in code by downloading the image yourself to a Bitmap class in your code.

To tell if it's relative or full reference you'll have to just check the src value to see if it starts with http or not. If not tack on the position the web page is in. Luckily many webservers still honor the whole ../../somedirectory/somefile.jpg syntax even if you give it an http before it. ie http://somedomain.com/path/to/page/dir/../../../images/image.png . I've used this trick a few times when scraping sites, still your mileage may vary.

One thing you could do for the image filtering by size is use a lambda expression instead.

 

var imageList = doc.DocumentNode.Descendants("img").Select(x=>MakeImgSrcAbsolute(x)).Where(x=>IsImageCorrectWidth(x.Attributes["src"])).ToList();

public void MakeImgSrcAbsolute(HtmlNode node)
{
    //fix src attribute here
}

public bool IsImageCorrectWidth(string source, int maxWidth)
{
   return new Bitmap(source).Width<=maxWidth;
}


 

Apr 23, 2010 at 4:54 PM

Thanks for your helpful reply, I am relatively new to C# and never heard about lambda expression before but I read about it and I think I got it.

2 quick questions for you, we basically want to do something similar to the facebook on our intranet, when a sales rep adds a link, we want to be able to get the html and display a list of images so they can select a main image. We thought about downloading them to a temp folder to check the size but we are concerned about performance, we want the images to be displayed as fast as possible, even though we will resize the image we would like to remove from the list icons or small images.

Facebook is very good at filtering icons, do you think they do what you suggested (bitmap) have you tried this before? is the bitmap loaded on memory and then released? there is no way to get that information from the html right?

return new Bitmap(source).Width<=maxWidth;

Question #2 is there an easy way to select only nodes where the extension is not gif using only one statement?

links = doc.DocumentNode.SelectNodes("//img");

 

Thanks a lot for your help!