Extract hostname name from string

Extract hostname name from string

I would like to match just the root of a URL and not the whole URL from a text string. Given:


http://www.example.com/12xy45
http://example.com/random

I want to get the 2 last instances resolving to the www.example.com or example.com domain.
I heard regex is slow and this would be my second regex expression on the page so If there is anyway to do it without regex let me know.
I’m seeking a JS/jQuery version of this solution.

Solutions/Answers:

Solution 1:

I recommend using the npm package psl (Public Suffix List). The “Public Suffix List” is a list of all valid domain suffixes and rules, not just Country Code Top-Level domains, but unicode characters as well that would be considered the root domain (i.e. www.食狮.公司.cn, b.c.kobe.jp, etc.). Read more about it here.

Try:

npm install --save psl

Then with my “extractHostname” implementation run:

let psl = require('psl');
let url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
psl.get(extractHostname(url)); // returns youtube.com

I can’t use an npm package, so below only tests extractHostname.

function extractHostname(url) {
    var hostname;
    //find & remove protocol (http, ftp, etc.) and get hostname

    if (url.indexOf("//") > -1) {
        hostname = url.split('/')[2];
    }
    else {
        hostname = url.split('/')[0];
    }

    //find & remove port number
    hostname = hostname.split(':')[0];
    //find & remove "?"
    hostname = hostname.split('?')[0];

    return hostname;
}

//test the code
console.log("== Testing extractHostname: ==");
console.log(extractHostname("http://www.blog.classroom.me.uk/index.php"));
console.log(extractHostname("http://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("https://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("ftps://ftp.websitename.com/dir/file.txt"));
console.log(extractHostname("websitename.com:1234/dir/file.txt"));
console.log(extractHostname("ftps://websitename.com:1234/dir/file.txt"));
console.log(extractHostname("example.com?param=value"));
console.log(extractHostname("https://facebook.github.io/jest/"));
console.log(extractHostname("//youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("http://localhost:4200/watch?v=ClkQA2Lb_iE"));

Regardless having the protocol or even port number, you can extract the domain. This is a very simplified, non-regex solution, so I think this will do.

Related:  requireJS optimizer does not include nested require calls

*Thank you @Timmerz, @renoirb, @rineez, @BigDong, @ra00l, @ILikeBeansTacos, @CharlesRobertson for your suggestions! @ross-allen, thank you for reporting the bug!

Solution 2:

A neat trick without using regular expressions:

var tmp        = document.createElement ('a');
;   tmp.href   = "http://www.example.com/12xy45";

// tmp.hostname will now contain 'www.example.com'
// tmp.host will now contain hostname and port 'www.example.com:80'

Wrap the above in a function such as the below and you have yourself a superb way of snatching the domain part out of an URI.

function url_domain(data) {
  var    a      = document.createElement('a');
         a.href = data;
  return a.hostname;
}

Solution 3:

Try this:

var matches = url.match(/^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)/i);
var domain = matches && matches[1];  // domain will be null if no match is found

If you want to exclude the port from your result, use this expression instead:

/^https?\:\/\/([^\/:?#]+)(?:[\/:?#]|$)/i

Edit: To prevent specific domains from matching, use a negative lookahead. (?!youtube.com)

/^https?\:\/\/(?!(?:www\.)?(?:youtube\.com|youtu\.be))([^\/:?#]+)(?:[\/:?#]|$)/i

Solution 4:

There is no need to parse the string, just pass your URL as an argument to URL constructor:

var url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
var hostname = (new URL(url)).hostname;

assert(hostname === 'www.youtube.com');

Solution 5:

Parsing a URL can be tricky because you can have port numbers and special chars. As such, I recommend using something like parseUri to do this for you. I doubt performance is going to be a issue unless you are parsing hundreds of URLs.

Related:  307 Redirect when loading analytics.js in Chrome

Solution 6:

I tried to use the Given solutions, the Chosen one was an overkill for my purpose and “Creating a element” one messes up for me.

It’s not ready for Port in URL yet. I hope someone finds it useful

function parseURL(url){
    parsed_url = {}

    if ( url == null || url.length == 0 )
        return parsed_url;

    protocol_i = url.indexOf('://');
    parsed_url.protocol = url.substr(0,protocol_i);

    remaining_url = url.substr(protocol_i + 3, url.length);
    domain_i = remaining_url.indexOf('/');
    domain_i = domain_i == -1 ? remaining_url.length - 1 : domain_i;
    parsed_url.domain = remaining_url.substr(0, domain_i);
    parsed_url.path = domain_i == -1 || domain_i + 1 == remaining_url.length ? null : remaining_url.substr(domain_i + 1, remaining_url.length);

    domain_parts = parsed_url.domain.split('.');
    switch ( domain_parts.length ){
        case 2:
          parsed_url.subdomain = null;
          parsed_url.host = domain_parts[0];
          parsed_url.tld = domain_parts[1];
          break;
        case 3:
          parsed_url.subdomain = domain_parts[0];
          parsed_url.host = domain_parts[1];
          parsed_url.tld = domain_parts[2];
          break;
        case 4:
          parsed_url.subdomain = domain_parts[0];
          parsed_url.host = domain_parts[1];
          parsed_url.tld = domain_parts[2] + '.' + domain_parts[3];
          break;
    }

    parsed_url.parent_domain = parsed_url.host + '.' + parsed_url.tld;

    return parsed_url;
}

Running this:

parseURL('https://www.facebook.com/100003379429021_356001651189146');

Result:

Object {
    domain : "www.facebook.com",
    host : "facebook",
    path : "100003379429021_356001651189146",
    protocol : "https",
    subdomain : "www",
    tld : "com"
}