Strip HTML from Text JavaScript

Strip HTML from Text JavaScript

Is there an easy way to take a string of html in JavaScript and strip out the html?

Solutions/Answers:

Solution 1:

If you’re running in a browser, then the easiest way is just to let the browser do it for you…

function strip(html)
{
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Note: as folks have noted in the comments, this is best avoided if you don’t control the source of the HTML (for example, don’t run this on anything that could’ve come from user input). For those scenarios, you can still let the browser do the work for you – see Saba’s answer on using the now widely-available DOMParser.

Solution 2:

myString.replace(/<[^>]*>?/gm, '');

Solution 3:

Simplest way:

jQuery(html).text();

That retrieves all the text from a string of html.

Solution 4:

I would like to share an edited version of the Shog9‘s approved answer.


As Mike Samuel pointed with a comment, that function can execute inline javascript codes.
But Shog9 is right when saying “let the browser do it for you…”

so.. here my edited version, using DOMParser:

function strip(html){
   var doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

here the code to test the inline javascript:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Also, it does not request resources on parse (like images)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

Solution 5:

As an extension to the jQuery method, if your string might not contian HTML (eg if you are trying to remove HTML from a form field)

jQuery(html).text();

will return an empty string if there is no html

Use:

jQuery('<p>' + html + '</p>').text();

instead.

Update:
As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.

Solution 6:

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text).

After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

the str variable starts out like this:

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

and then after the code has run it looks like this:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact. Also I have replaced the <p> and <br> tags with \n (newline char) so that some sort of visual formatting has been retained.

To change the link format (eg. BBC (Link->http://www.bbc.co.uk) ) just edit the $2 (Link->$1), where $1 is the href URL/URI and the $2 is the hyperlinked text. With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.

Hope you find this useful.