A Regular Expression to match any URL

Update: see my most recent comment. WordPress has a pretty good regex for matching URLs.

A bold claim, but I think I’ve got one:

|([A-Za-z]{3,9})://([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?((/[-\+~%/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?)?|

An online events booking system I developed doesn’t allow HTML in the event description field, primarily to protect against annoying scripting attacks. But what if you want to provide a link in the description? I need to detect plain text URLs stored in the database, and turn them into hyperlinks when displayed in the browser. The regular expression above allows me to do that quite easily in PHP:
$pattern = |([A-Za-z]{3,9})://([-;:&=+$,w]+@{1})?([-A-Za-z0-9.]+)+:?(d+)?((/[-+~%/.w]+)???([-+=&;%@.w]+)?#?([w]+)?)?|; $html = preg_replace($pattern, '<a href="$0">$0</a>', $text);

But the regular expression has several submatches. They provide a means to break down the URL into its constituent parts, including protocol, user info, server name, REQUEST_URI, query string and anchor.

Here’s a PHP class I wrote that uses this Regular Expression to analyse a string, detect URLs, populate an array with the constituent parts of the URL, and replace URLs with hyperlinks. Here’s an example of usage:

$text  = 'Please visit http://www.example.com/cgi-bin/';
$text .= 'script.cgi?variable=value&variable2=some';
$text .= '+url+encoded+text#section-1 to find out more';

$urlf = new URLFinder();
$html = $urlf->make_links($text);
echo $html;

If you print_r($urlf), you can see how the URL is broken down.

I haven’t managed to find any exceptions to the expression, but if you do, please post an example.

April 23rd, 2008|Tools & Technologies|21 Comments

About the Author: Chris Fryer

Goal: carrier-grade reliability. Nines. Lots of nines.

21 Comments

huymq85 May 14, 2008 at 3:50 am

your regular expression can not match this url:
“http://i88.photobucket.com/albums/k163/thien0211photo/hinh chuyen de up/KienG20cau20Xeobuom20copy.jpg”
Chris Fryer May 14, 2008 at 10:05 am

Thanks. But strictly speaking, the spaces in that URL should be encoded as %20.

http://www.faqs.org/rfcs/rfc1738.html

The application that uses this regular expression has to ignore whitespace, as it’s intended to detect URLs within longer strings that contain other text and convert them to links.

It looks like Photobucket correctly encodes URLs in their HTML, even if the %20 characters don’t show up in the address bar.
Pavel Perna May 30, 2008 at 4:16 pm

The Regular Expression does not match these, for example:

1. link inside of htmlspecialchars : >http://www.google.com<
2. this link : http://localhost:32000/?ch=1
Tobi August 19, 2008 at 7:30 pm

Great! This saved me a nerve-racking evening (I’m a regular expression loser) 🙂
Howie September 3, 2008 at 7:00 pm

I’ve updated it to be a bit more restrictive in requiring domains to end in .xx instead, eliminated consecutive .’s in the domain name. Packaged : with the port and ? or & (because some sites don’t use ? for some reason) with the parameter data. I’ve also accounted for the problem Pavel mentioned above. Here’s my regex:

(([A-Za-z]{3,9})://)?([-;:&=\+\$,\w]+@{1})?(([-A-Za-z0-9]+\.)+[A-Za-z]{2,3})(:\d+)?((/[-\+~%/\.\w]+)?/?([&?][-\+=&;%@\.\w]+)?(#[\w]+)?)?
Chris Fryer September 4, 2008 at 1:02 pm

Thanks, Howie. I’d suggest the limiting repetition modifier for top-level domains should be {2,6} to allow for things like .museum.

See the list of currently-implemented TLDs at the IANA:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt
MasterWaster September 18, 2008 at 10:11 pm

Probles with your code & links like http://google.de/?aaa

http://google.de//?aaa
works mmh…
Chris Fryer September 19, 2008 at 2:40 pm

This is one of the problems Pavel found. In your example URL, the default script name is implicit, so can be omitted before the query string.

The problematic portion is

(/[-\+~%/\.\w]+)?

Adding a second / results in a match, as you’ve found. So this portion of the regex should be:

(/[-\+~%/\.\w]?)?

This will match the trailing slash after the FQDN, and optionally match any valid filename characters that follow.

The complete subpattern is optional, so things like http://www.google.de will still be matched despite the lack of a trailing slash.
David Berlind October 4, 2008 at 5:44 am

I’ve tried all of these patterns to handle URLs from Flickr. For example,

http://www.flickr.com/photos/27703312@N04/

doing a simple replacement of the URL with the text “url” results in:

url@N04 or url@N04/

Looks like these strings don’t take the “@” into account.

David
Unreal Media January 30, 2009 at 11:15 am

Great, Thanks 🙂
Nibalkar August 22, 2011 at 10:20 am

Is this Worth ?

“(http|www)\S+(\s+)?\S+”
Marc November 13, 2011 at 10:02 am

It doesn’t compile on http://www.regexplanet.com/simple/index.html. Also it requires the http part whereas many people omit that when they input a url
Lucian (Low Cost Routes) November 18, 2011 at 5:06 pm

This is a thing that always surprises me: why the regular expressions for URL&email matching should be so long? I know that there are a lot of rules… but I prefer the shortest version (like Nibalkar’s) and risk to loose the others, instead of having to encode all the rules and exceptions of the rules.
Matthew O'Riordan November 23, 2011 at 2:10 am

I’ve been trawling the web for a solution and could not find one, so I’ve written my own solution which matches any URL in all the usual formats such as http://www.google.com, http://www.google.com, as well as emails such as matt@test.com.

See my blog post about the regular expression including a test case at http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the
Christopher Fryer November 29, 2011 at 3:23 pm

… and of course WordPress itself contains code that identifies URIs:

http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1413
Bob Ray May 27, 2012 at 5:01 am

It’s a nice regex, but it fails (like every one I’ve found) with URLs in text (which is what I need) when they end with punctuation, for example: Have you visited http://bobsguides.com? or I like http://bobsguides.com.

The terminal punctuation is considered part of the URL
Chris Fryer May 29, 2012 at 2:00 pm

But WordPress correctly handles the terminal punctuation, as you can see. A question mark and a dot are both valid parts of a URL, except where they appear on their own.

You can find that regex here:

http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1569

Someday I’ll update this post to reflect all the suggestions people have made. We get a lot of traffic from Google on this post, and it’d be good to give them something they can use. In the meantime, try the WordPress regex.
SIFE December 8, 2012 at 12:14 am

I think this regular expression posted by Mr Jeff Atwood is the most correct one that can match known url’s:
\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
Tested against those url’s:
http://codecanyon.net/ http://woork.blogspot.com/2008/06/clean-and-pure-css-form-design.html http://www.onextrapixel.com/2009/08/25/how-to-use-pure-css-to-style-web-form-dynamically-plus-12-awesome-javascript-plugins/
Lucas Pelegrino April 17, 2014 at 8:26 pm

Thank you very much!

That regex was the only one that worked with my very edged use case 😉

Regards from Brazil.
Beladel ilyes Abdelrazak May 24, 2014 at 1:53 am

I have looked around the web for URL pattern matching, but unfortunately all of theme failed against the tested URL that may take this form:
http://www.example.com/etcetc. https://www.example.com/etcetc. http://example.com/etcetc. www.example.com/etcetc example.com/etcetc user:pass@example.com/etcetc http://i88.photobucket.com/albums/k163/thien0211photo/hinh chuyen de up/KienG20cau20Xeobuom20copy.jpg http://www.google.com&lt http://google.de/?aaa http://google.de//?aaa http://www.flickr.com/photos/27703312@N04/ http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1413 http://bobsguides.com? http://bobsguides.com. http://woork.blogspot.com/2008/06/clean-and-pure-css-form-design.html http://www.onextrapixel.com/2009/08/25/how-to-use-pure-css-to-style-web-form-dynamically-plus-12-awesome-javascript-plugins/

So, I wrote my own and make sure that it will pass the test, please comment if you know an URL can’t be matched by this pattern:
$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#& ]+\.([a-zA-Z0-9\.\/\?\:@\-_=#& ])*/';
Johnathan December 1, 2014 at 7:05 am

I think you should escape the slashes there. I create a regex at a live regex tester using the regex in last comment and it’s working.

https://www.liveregex.com/opZ2k