HtmlAgilitypack on strings C#

Jan 30, 2013 at 6:57 AM

Will this project work if I have a string with regular text as well as html mixed together? I.e I have plain text and paragraphs mixed in one string.

Jan 30, 2013 at 7:11 AM

Yes it will still work but you will be unable to get the text before and after the first tags

e.g.

abcd

hij
dsfg

hp.DocumentNode.SelectNodes("//div").InnerText == "hij"

Jan 30, 2013 at 7:17 AM

Thank you for the answer I suppose I will keep trying to get my Regex to match. I do see the advantage in a html-parser, but I don't think it'll work in this instance where the plain-text is also important.

Jan 30, 2013 at 7:21 AM

If you need help with regex give us a shout have done some complicated regex in the past.

Lee

On Jan 30, 2013 7:17 PM, "Lobsterfun" <notifications@codeplex.com> wrote:

From: Lobsterfun

Thank you for the answer I suppose I will keep trying to get my Regex to match. I do see the advantage in a html-parser, but I don't think it'll work in this instance where the plain-text is also important.

Read the full discussion online.

To add a post to this discussion, reply to this email (htmlagilitypack@discussions.codeplex.com)

To start a new discussion for this project, email htmlagilitypack@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Jan 30, 2013 at 7:30 AM
Edited Jan 30, 2013 at 7:36 AM

I actually have a pretty big issue.

I am printing to a pdf in C# and my boss wanted me to implement tinymce which went fine. Unfortunately we are using an old pdf printer-class that only supports b, i and u tags in html. but I need to be able to create indents as well. 

this is how indentation looks in tinymce 

[p style="padding-left: 30px;">ijkk</p] (tags look like <p>)

 

Unfortunately I only have a string that contains mixed plaintext and html(from tinymce) so I wanted to write a regex that get's all p tags with attributes(I have done this) and then based on the amount of pixels in "padding-left:", replace it with "    whitespace30px"+text(if that makes sense?)

 

here's what I have come up with so far:

text = Regex.Replace(text, @"<p.*?>(.*)</p>", "    " + "$1");

but the whitespace is hard coded

Jan 30, 2013 at 7:41 PM
Edited Jan 30, 2013 at 7:45 PM

Here is my solution would have to use a string loop through the regex to do the replace as couldn't think of a way to do multiple spaces based on the match.

string html = "erwerw<p style=\"padding-left: 30px;\">ijkk</p>rwwr<p style=\"padding-left: 40px;\">ijkk</p>";
mc = Regex.Matches(html, "(?<complete><p[^>]*style=[^>]*\"[^>]*padding-left:\\s?(?<number>[0-9]*)px[^>]*>(?<text>[^<]*)</p>)");
foreach (Match match in mc)
{
   if (match.Groups.Count > 0)
   {
      html = html.Replace(match.Groups["complete"].Value, string.Format("{0}{1}", new string(' ',Convert.ToInt32(match.Groups["number"].Value)), match.Groups["text"].Value));
   }
}
Hope this helps
Jan 31, 2013 at 6:17 AM
Wow this is perfect :-D
Thank you very much! I didn't know if it was even possible. I had made a different solution, but it is not even close to being as dynamic as this regex!
May 13, 2014 at 11:31 PM
I couldn’t help but notice you saying "If you need help with regex give us a shout have done some complicated regex in the past". I have been stuck on this issue for a while now and I am looking for any advice or support I can get. If you would be so kind to check out my question I would be more than appreciative. https://htmlagilitypack.codeplex.com/discussions/545246
I just want to remove html tags from my string but I also want to keep the formatting of bold and italics....thank you in advance.
May 13, 2014 at 11:42 PM
Hi Adandrea,

just taking a quick look it looks as if you just want to remove the html tags and the best way I would use is to do something similar to what you did with Regex because using HTMLAgaility pack is mainly for pulling apart a page and pulling specific information. for example a table always appears on a page named id="lookingforme" and you can pull the details out of that table and use them I can give you a sample for this. But if you give me a sample of the html you want to strip out the tags from but leave the italic and bold tags I could see what I can do.

Regards,

Lee


May 13, 2014 at 11:53 PM
thank you so much Lee.

one problem is that this description field is pulling in the description of different incidents which can change depending on which one the user selects to view. however here is one example:

this is what i should be seeing ( with the work API italics and the name Incident API in bold) :
For testing of API calls in Incident API.

however this is what i see:
<p style="margin: 0px 0px 12px 0px;text-align: left;text-indent: 0pt;padding: 0px 0px 0px 0px;"><span style="font-family: 'Verdana';

im not sure if some of the tags may be running off the page but from this point this is all I can see.

thank you so much, i am semi new to coding and i really appreciate the help. the regex i am using now will display the text but will not show any bold or italics.

cant thank you enough!
May 14, 2014 at 12:00 AM
I can't see any italics in


was looking for either test italics or test bold to be in the html or maybe

test styling


if you could include a sample with that is it would be great.



May 14, 2014 at 12:09 AM
here you go this is what I am getting:
<p style="margin: 0px 0px 12px 0px;text-align: left;text-indent: 0pt;padding: 0px 0px 0px 0px;"><span style="font-family: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000;">For testing of </span><span style="font-family: Verdana; font-weight: normal; font-size: 16px; color: rgb(0, 0, 0);"><i>API </i></span><span style="font-family: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000;">calls in </span><span style="font-family: 'Verdana';font-style: Normal;font-weight: bold;font-size: 16px;color: #000000;">Incident API</span><span style="font-family: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000;">.</span></p> sorry about that, there was a ton of code running off the page i couldnt see. hope that will be more helpful.
May 14, 2014 at 1:04 AM
Edited May 14, 2014 at 1:06 AM
How about this. will require 2 passes..

result:
<b>For testing of </b><i>API </i>calls in <b>Incident API</b>.
string v = "<p style=\"margin: 0px 0px 12px 0px;text-align: left;text-indent: 0pt;padding: 0px 0px 0px 0px;\"><span style=\"font-family: 'Verdana';font-style: Normal;font-weight: bold;font-size: 16px;color: #000000;\">For testing of </span><span style=\"font-family: Verdana; font-weight: normal; font-size: 16px; color: rgb(0, 0, 0);\"><i>API </i></span><span style=\"font-family: 'Verdana';font-style: Normal;font-eight: normal;font-size: 16px;color: #000000;\">calls in </span><span style=\"font-family: 'Verdana';font-style: Normal;font-weight: bold;font-size: 16px;color: #000000;\">Incident API</span><span style=\"font-amily: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000;\">.</span></p>";

Regex r = new Regex(@"<span [^>]*(?:font-weight:\s?bold)[^>]*>(?<boldtext>[^<]*)</span>");
string t = r.Replace(v,"<b>$1</b>");


Regex p = new Regex("<((?!b)(?!i))[^/>]*>|</((?!b)(?!i))[^>]*>");
string q = p.Replace(t, "");
May 14, 2014 at 1:05 AM
p.s I changed the html you sent and added an extra bold in there for testing.


May 14, 2014 at 1:12 AM
thanks so much, im going to try this now. should i just be able to add this into my code of do i need to reference the description field at all?
May 14, 2014 at 1:22 AM
replace
report.Description = Regex.Replace(report.Description, "<.*?>| ", string.Empty);

with the following
Regex r = new Regex(@"<span [^>]*(?:font-weight:\s?bold)[^>]*>(?<boldtext>[^<]*)</span>");
        report.Description = r.Replace(report.Description, "<b>$1</b>");


        Regex p = new Regex("<((?!b)(?!i))[^/>]*>|</((?!b)(?!i))[^>]*>");
        report.Description = p.Replace(report.Description, "");
May 14, 2014 at 1:29 AM
thanks for the clarification but unfortunetly it doesnt look like its taking the formatting....this is what I got this time: For testing of <i>API </i>calls in <b>Incident API</b>.
May 14, 2014 at 1:32 AM
That is correct tho isn't it? you wanted the bold and italic formatting to stay and only the text and italic and bold to remain?


May 14, 2014 at 5:38 PM
yes that is correct but i was hoping that i would actualy see the text bold and italics instead of seeing the tags...does that make sense and is that possible?
May 14, 2014 at 10:45 PM
ok you have to actuall create a font for each part of the text and draw that.. bit annoying for you but here is a sample of how I'd do it.

you could also try this companies html to pdf software I use it at work and is very handy.
http://www.winnovative-software.com/
static void Main(string[] args)
        {
            
            string v = "<p style=\"margin: 0px 0px 12px 0px;text-align: left;text-indent: 0pt;padding: 0px 0px 0px 0px;\"><span style=\"font-family: 'Verdana';font-style: Normal;font-weight: bold;font-size: 16px;color: #000000;\">For testing of </span><span style=\"font-family: Verdana; font-weight: normal; font-size: 16px; color: rgb(0, 0, 0);\"><i>API </i></span><span style=\"font-family: 'Verdana';font-style: Normal;font-eight: normal;font-size: 16px;color: #000000;\">calls in </span><span style=\"font-family: 'Verdana';font-style: Normal;font-weight: bold;font-size: 16px;color: #000000;\">Incident API</span><span style=\"font-amily: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000;\">.</span></p>";
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(v);

            foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//span"))
            {
                bool hasBold = false;
                bool hasItalic = false;
                string text = link.InnerText;
                if (link.Attributes["style"].Value.Contains("bold")){
                    hasBold = true;
                }
                if (link.InnerHtml.Contains("<i>")){
                    hasItalic = true;
                }

                font2 = GetFont(hasBold, hasItalic);

                graphics.DrawString("" + text, font2, XBrushes.Black, new XRect(margin + 90, page.Height - (lineHeight * 35), page.Width, page.Height), XStringFormats.TopLeft);

            }
        }


        private void GetFont(bool hasBold, bool hasItalic)
        {
            XFont font;
            if (hasBold && hasItalic)
                font = new XFont("Verdana", 20, XFontStyle.BoldItalic);
            else if (hasBold)
                font = new XFont("Verdana", 20, XFontStyle.Bold);
            else if (hasItalic)
                font = new XFont("Verdana", 20, XFontStyle.Italic);
            else 
                font = new XFont("Verdana", 20, XFontStyle.Normal);
            return font;
        }
May 14, 2014 at 10:55 PM
thank you so much for all your help and support, you have no idea how much I appreciate it. For the example above because i am pulling in different incidents with different description fields will this mean I need to change the static void main area for each instance or is there a way to set it so no matter what incident gets choosen that correct formatting will appear? not sure if that makes sense but just curious how that will work.
May 14, 2014 at 11:14 PM
It's no problem.. What you should do is create a function that will loop through all incidents and and set v as report description
foreach (var incident in incidents){
 
 DrawDescription(incident.Description);

}
then create a method out of main that takes description as a param and load into the document

doc.LoadHtml(description)

it should work assuming all descriptions are similar

i.e have spans containing all the text with italics as tags and bold in the style. If there are some differences you will have to code them in by creating new link if statements

if (link.Attributes["style"].Value.Contains("italic"))
        public static void CreateDocument()
        {
            foreach (var incident in incidents)
            {
                DrawDescription(incident.Description);
            }
        }

        public static void DrawDescription(string description)
        {
            
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(description);

            foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//span"))
            {
                bool hasBold = false;
                bool hasItalic = false;
                string text = link.InnerText;
                if (link.Attributes["style"].Value.Contains("bold"))
                {
                    hasBold = true;
                }
                if (link.InnerHtml.Contains("<i>"))
                {
                    hasItalic = true;
                }

                font2 = GetFont(hasBold, hasItalic);

                graphics.DrawString("" + text, font2, XBrushes.Black, new XRect(margin + 90, page.Height - (lineHeight * 35), page.Width, page.Height), XStringFormats.TopLeft);

            }
        }

        private static void GetFont(bool hasBold, bool hasItalic)
        {
            XFont font;
            if (hasBold && hasItalic)
                font = new XFont("Verdana", 20, XFontStyle.BoldItalic);
            else if (hasBold)
                font = new XFont("Verdana", 20, XFontStyle.Bold);
            else if (hasItalic)
                font = new XFont("Verdana", 20, XFontStyle.Italic);
            else
                font = new XFont("Verdana", 20, XFontStyle.Normal);
            return font;
        }
May 14, 2014 at 11:25 PM
wow i cant thank you enough. i will try this later today and let you no how it goes. thank you again !
May 14, 2014 at 11:26 PM
Do.. Hope it works..

Best of luck.

Lee


May 21, 2014 at 7:38 PM
Hi Lee,

Sorry it took so long to get back to you...i tested out the code and the good news it that it looks like it is displaying the bold text however the way it is being displayed on the page is somewhat overlapping. all the text of that description field is overlapping so you cant fully read what it says. I do see the bold tho which is a great sign and i dont see any tags! im going to try to look into this issue a little more and if you have any advice that would be great too. cant thank you enough.
May 28, 2014 at 7:02 PM
Edited Jun 19, 2014 at 10:25 PM
*