Embolden keywords without messing up HTML tags

Topics: Developer Forum, User Forum
Mar 24, 2011 at 7:14 PM

Our web site emboldens certain words in the text of user generated content by using vb.net's Replace function. But this is breaking links and image tags if the word(s) we want to embolden also appear in the src attribute of an <img> tag or the href attribute of an <a> tag.

Does anyone know how to use the Html Agility Pack to replace all ocurrences of a word except where this word in in an attribute of an HTML tag?

For example, say a user has entered the following html:

Dogs are my favourite animals.
<a href="http://www.dogs.com">Click here to see my dog</a>. 
Here is a photo of some dogs: <img src="/images/dogs.jpg" />.

If I use VB.Net's replace function... strHtml = Replace(strHtml, "dogs", "<strong>dogs</dogs>") ... that works fine for the text outside the <a> and <img> tags but wrecks the href and src attributes, making them look like this:

href="http://www.<strong>dogs</strong>.com"
src="/images/<strong>dogs</strong>.com"
Any help much appreciated.
Sep 29, 2011 at 10:11 PM

Did you found solution for this, I am interested also ....

 

Thanks.

Oct 3, 2011 at 4:58 PM

This is how I made it work for my project. Call the function EmphasiseKeywordsInHtml passing it the text as a string and an array of all the words you want emphasising as an array. For the emphasisTag parameter, this can be "b" or "strong" or "i" depending on which tag you want to wrap. Let me know if you find this useful.

        Public Shared Function EmphasiseKeywordsInHtml(ByVal text As String, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String, ByVal outputAsXHtml As Boolean) As String

            Dim strReturn As String = ""

            'Check for empty/zero length string.
            If text <> "" Then

                'Load the content into an Html Agility Pack document.
                Dim document As HtmlDocument = New HtmlDocument
                document.LoadHtml(text)


                'Are we outputting as XHTML or HTML?
                If outputAsXHtml = True Then
                    'Output as XTHML.
                    document.OptionWriteEmptyNodes = True
                Else
                    'Output as HTML.
                    document.OptionWriteEmptyNodes = False
                End If


                'Recursively iterate through the nodes of the html emphasising keywords.
                _EmphasiseKeywordsInHtmlRecursive(document.DocumentNode, keywordsToEmphasise, emphasisTag)


                'Output the Html document to a string.
                Dim sb As StringBuilder = New StringBuilder()
                Using sw As New StringWriter(sb)
                    document.Save(sw)
                    sw.Flush()
                    strReturn = sw.ToString()
                End Using
                sb = Nothing


                'Finished with the Html Agility Pack document.
                document = Nothing

            End If

            'Return the cleaned text/html.
            Return strReturn

        End Function


        Private Shared Sub _EmphasiseKeywordsInHtmlRecursive(ByRef node As HtmlNode, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String)

            Dim intNodeIndex As Integer = -1
            Dim blnNodeRemoved As Boolean
            Dim strNodeText As String
            Dim objRegEx As Regex

            If node.HasChildNodes = True Then

                'Iterate through all child nodes in this node.
                Do
                    'Increment child node index.
                    intNodeIndex += 1
                    blnNodeRemoved = False


                    'Is the node a text node?
                    If node.ChildNodes(intNodeIndex).Name = "#text" Then

                        'Put the child node's InnerHtml into a string variable.
                        strNodeText = node.ChildNodes(intNodeIndex).InnerHtml


                        'Iterate through array of keywords, wrapping the emphasis tag round each occurrence of the keywords.
                        For i As Integer = 0 To keywordsToEmphasise.Count - 1

                            Select Case Trim(LCase(keywordsToEmphasise(i)))
                                Case "", emphasisTag, "<", ">", "."
                                    'Don't emphasise these words.

                                Case Else
                                    'Replace the keyword using a case-insensitive match with the same keyword wrapped in the emphasis tag and preserve the case of the original word.
                                    objRegEx = New Regex("(" & Regex.Escape(keywordsToEmphasise(i)) & ")", RegexOptions.IgnoreCase)
                                    strNodeText = objRegEx.Replace(strNodeText, "<" & emphasisTag & ">$1</" & emphasisTag & ">")

                            End Select

                        Next


                        'Set the modified text as the InnerHtml of the child html node.
                        node.ChildNodes(intNodeIndex).InnerHtml = strNodeText

                    End If


                    'Process this child node's children (only if this node has not been removed).
                    If blnNodeRemoved = False Then
                        _EmphasiseKeywordsInHtmlRecursive(node.ChildNodes(intNodeIndex), keywordsToEmphasise, emphasisTag)

                    Else
                        'The node was removed, decrement the current node index.
                        intNodeIndex -= 1
                    End If


                    'Have we just processed the last child node?
                    If intNodeIndex = node.ChildNodes.Count - 1 Then
                        'We have finished processing all the child nodes.
                        Exit Do
                    End If

                Loop

            End If

        End Sub

        Public Shared Function EmphasiseKeywordsInHtml(ByVal text As String, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String, ByVal outputAsXHtml As Boolean) As String

            Dim strReturn As String = ""

            'Check for empty/zero length string.
            If text <> "" Then

                'Load the content into an Html Agility Pack document.
                Dim document As HtmlDocument = New HtmlDocument
                document.LoadHtml(text)


                'Are we outputting as XHTML or HTML?
                If outputAsXHtml = True Then
                    'Output as XTHML.
                    document.OptionWriteEmptyNodes = True
                Else
                    'Output as HTML.
                    document.OptionWriteEmptyNodes = False
                End If


                'Recursively iterate through the nodes of the html emphasising keywords.
                _EmphasiseKeywordsInHtmlRecursive(document.DocumentNode, keywordsToEmphasise, emphasisTag)


                'Output the Html document to a string.
                Dim sb As StringBuilder = New StringBuilder()
                Using sw As New StringWriter(sb)
                    document.Save(sw)
                    sw.Flush()
                    strReturn = sw.ToString()
                End Using
                sb = Nothing


                'Finished with the Html Agility Pack document.
                document = Nothing

            End If

            'Return the cleaned text/html.
            Return strReturn

        End Function


        Private Shared Sub _EmphasiseKeywordsInHtmlRecursive(ByRef node As HtmlNode, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String)

            Dim intNodeIndex As Integer = -1
            Dim blnNodeRemoved As Boolean
            Dim strNodeText As String
            Dim objRegEx As Regex

            If node.HasChildNodes = True Then

                'Iterate through all child nodes in this node.
                Do
                    'Increment child node index.
                    intNodeIndex += 1
                    blnNodeRemoved = False


                    'Is the node a text node?
                    If node.ChildNodes(intNodeIndex).Name = "#text" Then

                        'Put the child node's InnerHtml into a string variable.
                        strNodeText = node.ChildNodes(intNodeIndex).InnerHtml


                        'Iterate through array of keywords, wrapping the emphasis tag round each occurrence of the keywords.
                        For i As Integer = 0 To keywordsToEmphasise.Count - 1

                            Select Case Trim(LCase(keywordsToEmphasise(i)))
                                Case "", emphasisTag, "<", ">", "."
                                    'Don't emphasise these words.

                                Case Else
                                    'Replace the keyword using a case-insensitive match with the same keyword wrapped in the emphasis tag and preserve the case of the original word.
                                    objRegEx = New Regex("(" & Regex.Escape(keywordsToEmphasise(i)) & ")", RegexOptions.IgnoreCase)
                                    strNodeText = objRegEx.Replace(strNodeText, "<" & emphasisTag & ">$1</" & emphasisTag & ">")

                            End Select

                        Next


                        'Set the modified text as the InnerHtml of the child html node.
                        node.ChildNodes(intNodeIndex).InnerHtml = strNodeText

                    End If


                    'Process this child node's children (only if this node has not been removed).
                    If blnNodeRemoved = False Then
                        _EmphasiseKeywordsInHtmlRecursive(node.ChildNodes(intNodeIndex), keywordsToEmphasise, emphasisTag)

                    Else
                        'The node was removed, decrement the current node index.
                        intNodeIndex -= 1
                    End If


                    'Have we just processed the last child node?
                    If intNodeIndex = node.ChildNodes.Count - 1 Then
                        'We have finished processing all the child nodes.
                        Exit Do
                    End If

                Loop

            End If

        End Sub
        Public Shared Function EmphasiseKeywordsInHtml(ByVal text As String, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String, ByVal outputAsXHtml As Boolean) As String

            Dim strReturn As String = ""

            'Check for empty/zero length string.
            If text <> "" Then

                'Load the content into an Html Agility Pack document.
                Dim document As HtmlDocument = New HtmlDocument
                document.LoadHtml(text)


                'Are we outputting as XHTML or HTML?
                If outputAsXHtml = True Then
                    'Output as XTHML.
                    document.OptionWriteEmptyNodes = True
                Else
                    'Output as HTML.
                    document.OptionWriteEmptyNodes = False
                End If


                'Recursively iterate through the nodes of the html emphasising keywords.
                _EmphasiseKeywordsInHtmlRecursive(document.DocumentNode, keywordsToEmphasise, emphasisTag)


                'Output the Html document to a string.
                Dim sb As StringBuilder = New StringBuilder()
                Using sw As New StringWriter(sb)
                    document.Save(sw)
                    sw.Flush()
                    strReturn = sw.ToString()
                End Using
                sb = Nothing


                'Finished with the Html Agility Pack document.
                document = Nothing

            End If

            'Return the cleaned text/html.
            Return strReturn

        End Function


        Private Shared Sub _EmphasiseKeywordsInHtmlRecursive(ByRef node As HtmlNode, ByVal keywordsToEmphasise() As String, ByVal emphasisTag As String)

            Dim intNodeIndex As Integer = -1
            Dim blnNodeRemoved As Boolean
            Dim strNodeText As String
            Dim objRegEx As Regex

            If node.HasChildNodes = True Then

                'Iterate through all child nodes in this node.
                Do
                    'Increment child node index.
                    intNodeIndex += 1
                    blnNodeRemoved = False


                    'Is the node a text node?
                    If node.ChildNodes(intNodeIndex).Name = "#text" Then

                        'Put the child node's InnerHtml into a string variable.
                        strNodeText = node.ChildNodes(intNodeIndex).InnerHtml


                        'Iterate through array of keywords, wrapping the emphasis tag round each occurrence of the keywords.
                        For i As Integer = 0 To keywordsToEmphasise.Count - 1

                            Select Case Trim(LCase(keywordsToEmphasise(i)))
                                Case "", emphasisTag, "<", ">", "."
                                    'Don't emphasise these words.

                                Case Else
                                    'Replace the keyword using a case-insensitive match with the same keyword wrapped in the emphasis tag and preserve the case of the original word.
                                    objRegEx = New Regex("(" & Regex.Escape(keywordsToEmphasise(i)) & ")", RegexOptions.IgnoreCase)
                                    strNodeText = objRegEx.Replace(strNodeText, "<" & emphasisTag & ">$1</" & emphasisTag & ">")

                            End Select

                        Next


                        'Set the modified text as the InnerHtml of the child html node.
                        node.ChildNodes(intNodeIndex).InnerHtml = strNodeText

                    End If


                    'Process this child node's children (only if this node has not been removed).
                    If blnNodeRemoved = False Then
                        _EmphasiseKeywordsInHtmlRecursive(node.ChildNodes(intNodeIndex), keywordsToEmphasise, emphasisTag)

                    Else
                        'The node was removed, decrement the current node index.
                        intNodeIndex -= 1
                    End If


                    'Have we just processed the last child node?
                    If intNodeIndex = node.ChildNodes.Count - 1 Then
                        'We have finished processing all the child nodes.
                        Exit Do
                    End If

                Loop

            End If

        End Sub