Recently I asked How do I retrieve OuterHtml without the InnerHtml? and this is the solution I came up with. It's C# but it should translate okay to most other languages.
I'm doing this because the project I'm working on involves checking third-party websites for back-links to our clients' websites. The information in the back-link may be wrong enough for us to disavow it.
The information I generate from this and other code in the project goes into an XML file and eventually into SQL Server. The HTML that contains the various identifying strings needs to be kept to a minimum. Why take the whole <table>
if the identifier falls in the src
of an <img>
in the 52nd <tr>
, 7th <td>
?
Here's the code
private string OuterMinusInner(HtmlNode root)
{
if (root == null)
return string.Empty;
foreach (var nodeFromList in
(from node
in root.ChildNodes
where node.NodeType != HtmlNodeType.Text
select node).ToList())
{
root.RemoveChild(nodeFromList);
}
return root.OuterHtml;
}
The method signature defines a single parameter root
as an HtmlNode. The method will return a string.
Next, the method tests for root
being null and if it is, the method returns an empty string to the caller.
Next comes some Linq code. I'm fairly new to Linq. I've known about it for years, but only really got into it after working through some of the tasks on the C# track at Exercism.
The Linq query from node in root.ChildNodes where node.NodeType != HtmlNodeType.Text select node
gets all of the child nodes in root
where the HtmlNodeType is anything other than Text
(viz Element
, Document
or Comment
.)
The results of the query are committed to a List (of HtmlNode) using .ToList()
. This is important. If you don't do this, the code will crash at run-time because the subsequent .RemoveChild()
will change the number of child nodes of root
, nodes that the Linq code is (otherwise) enumerating on the fly.
The foreach takes each element of the List of HtmlNode returned from the .ToList of the query and puts it into nodeFromList
, using that value as the node to remove from root
(in root.RemoveChild(nodeFromList)
).
When all the non-Text nodes are removed from root
the method ends, returning the OuterHTML of root
.
Example:
This
<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu"><li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-9905" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-9910" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-6785" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-7202" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page current_page_parent menu-item-11332" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10938" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-31" role="menuitem"></li>
</ul>
becomes this
<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu">
</ul>
Top comments (0)