DEV Community

Bruce Axtens
Bruce Axtens

Posted on

3

Retrieving OuterHTML without InnerHTML in C#

Recently I asked How do I retrieve OuterHtml without the InnerHtml? and this is the solution I came up with. It's C# but it should translate okay to most other languages.

I'm doing this because the project I'm working on involves checking third-party websites for back-links to our clients' websites. The information in the back-link may be wrong enough for us to disavow it.

The information I generate from this and other code in the project goes into an XML file and eventually into SQL Server. The HTML that contains the various identifying strings needs to be kept to a minimum. Why take the whole <table> if the identifier falls in the src of an <img> in the 52nd <tr>, 7th <td>?

Here's the code

private string OuterMinusInner(HtmlNode root)
{
    if (root == null)
        return string.Empty;

    foreach (var nodeFromList in
        (from node
         in root.ChildNodes 
         where node.NodeType != HtmlNodeType.Text 
         select node).ToList())
    {
        root.RemoveChild(nodeFromList);
    }

    return root.OuterHtml;
}
Enter fullscreen mode Exit fullscreen mode

The method signature defines a single parameter root as an HtmlNode. The method will return a string.

Next, the method tests for root being null and if it is, the method returns an empty string to the caller.

Next comes some Linq code. I'm fairly new to Linq. I've known about it for years, but only really got into it after working through some of the tasks on the C# track at Exercism.

The Linq query from node in root.ChildNodes where node.NodeType != HtmlNodeType.Text select node gets all of the child nodes in root where the HtmlNodeType is anything other than Text (viz Element, Document or Comment.)

The results of the query are committed to a List (of HtmlNode) using .ToList(). This is important. If you don't do this, the code will crash at run-time because the subsequent .RemoveChild() will change the number of child nodes of root, nodes that the Linq code is (otherwise) enumerating on the fly.

The foreach takes each element of the List of HtmlNode returned from the .ToList of the query and puts it into nodeFromList, using that value as the node to remove from root (in root.RemoveChild(nodeFromList)).

When all the non-Text nodes are removed from root the method ends, returning the OuterHTML of root.

Example:
This

<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu"><li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-9905" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-9910" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-6785" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-7202" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page current_page_parent menu-item-11332" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10938" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-31" role="menuitem"></li>
</ul>
Enter fullscreen mode Exit fullscreen mode

becomes this

<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu">







</ul>
Enter fullscreen mode Exit fullscreen mode

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay