Using the browser’s parser is the probably the best bet in current browsers. The following will work, with the following caveats:
- Your HTML is valid within a
<div>element. HTML contained within
<head>tags is not valid within a
<div>and may therefore not be parsed correctly.
textContent(the DOM standard property) and
innerText(non-standard) properties are not identical. For example,
textContentwill include text within a
innerTextwill not (in most browsers). This only affects IE <=8, which is the only major browser not to support
- The HTML does not contain
- The HTML is not
<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>
var html = "<p>Some HTML</p>"; var div = document.createElement("div"); div.innerHTML = html; var text = div.textContent || div.innerText || "";
cleanText = strInputCode.replace(/<\/?[^>]+(>|$)/g, "");
Distilled from this website (web.achive).
var html = "<p>Hello, <b>World</b>"; var div = document.createElement("div"); div.innerHTML = html; alert(div.innerText); // Hello, World
That pretty much the best way of doing it, you’re letting the browser do what it does best — parse HTML.
Edit: As noted in the comments below, this is not the most cross-browser solution. The most cross-browser solution would be to recursively go through all the children of the element and concatenate all text nodes that you find. However, if you’re using jQuery, it already does it for you:
Check out the text method.
I know this question has an accepted answer, but I feel that it doesn’t work in all cases.
<script /> tag above.