A bold claim, but I think I’ve got one:
|([A-Za-z]{3,9})://([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?((/[-\+~%/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?)?|
An online events booking system I developed doesn’t allow HTML in the event description field, primarily to protect against annoying scripting attacks. But what if you want to provide a link in the description? I need to detect plain text URLs stored in the database, and turn them into hyperlinks when displayed in the browser. The regular expression above allows me to do that quite easily in PHP:
$pattern = |([A-Za-z]{3,9})://([-;:&=+$,w]+@{1})?([-A-Za-z0-9.]+)+:?(d+)?((/[-+~%/.w]+)???([-+=&;%@.w]+)?#?([w]+)?)?|;
$html = preg_replace($pattern, '<a href="$0">$0</a>', $text);
But the regular expression has several submatches. They provide a means to break down the URL into its constituent parts, including protocol, user info, server name, REQUEST_URI, query string and anchor.
Here’s a PHP class I wrote that uses this Regular Expression to analyse a string, detect URLs, populate an array with the constituent parts of the URL, and replace URLs with hyperlinks. Here’s an example of usage:
$text = 'Please visit http://www.example.com/cgi-bin/'; $text .= 'script.cgi?variable=value&variable2=some'; $text .= '+url+encoded+text#section-1 to find out more'; $urlf = new URLFinder(); $html = $urlf->make_links($text); echo $html;
If you print_r($urlf), you can see how the URL is broken down.
I haven’t managed to find any exceptions to the expression, but if you do, please post an example.


your regular expression can not match this url:
“http://i88.photobucket.com/albums/k163/thien0211photo/hinh chuyen de up/KienG20cau20Xeobuom20copy.jpg”
Thanks. But strictly speaking, the spaces in that URL should be encoded as %20.
http://www.faqs.org/rfcs/rfc1738.html
The application that uses this regular expression has to ignore whitespace, as it’s intended to detect URLs within longer strings that contain other text and convert them to links.
It looks like Photobucket correctly encodes URLs in their HTML, even if the %20 characters don’t show up in the address bar.
The Regular Expression does not match these, for example:
1. link inside of htmlspecialchars : >http://www.google.com<
2. this link : http://localhost:32000/?ch=1
Great! This saved me a nerve-racking evening (I’m a regular expression loser)
I’ve updated it to be a bit more restrictive in requiring domains to end in .xx instead, eliminated consecutive .’s in the domain name. Packaged : with the port and ? or & (because some sites don’t use ? for some reason) with the parameter data. I’ve also accounted for the problem Pavel mentioned above. Here’s my regex:
(([A-Za-z]{3,9})://)?([-;:&=\+\$,\w]+@{1})?(([-A-Za-z0-9]+\.)+[A-Za-z]{2,3})(:\d+)?((/[-\+~%/\.\w]+)?/?([&?][-\+=&;%@\.\w]+)?(#[\w]+)?)?
Thanks, Howie. I’d suggest the limiting repetition modifier for top-level domains should be {2,6} to allow for things like
.museum.See the list of currently-implemented TLDs at the IANA:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Probles with your code & links like http://google.de/?aaa
http://google.de//?aaa
works mmh…
This is one of the problems Pavel found. In your example URL, the default script name is implicit, so can be omitted before the query string.
The problematic portion is
(/[-\+~%/\.\w]+)?
Adding a second / results in a match, as you’ve found. So this portion of the regex should be:
(/[-\+~%/\.\w]?)?
This will match the trailing slash after the FQDN, and optionally match any valid filename characters that follow.
The complete subpattern is optional, so things like http://www.google.de will still be matched despite the lack of a trailing slash.
I’ve tried all of these patterns to handle URLs from Flickr. For example,
http://www.flickr.com/photos/27703312@N04/
doing a simple replacement of the URL with the text “url” results in:
url@N04 or url@N04/
Looks like these strings don’t take the “@” into account.
David
Great, Thanks
Is this Worth ?
“(http|www)\S+(\s+)?\S+”
It doesn’t compile on http://www.regexplanet.com/simple/index.html. Also it requires the http part whereas many people omit that when they input a url
This is a thing that always surprises me: why the regular expressions for URL&email matching should be so long? I know that there are a lot of rules… but I prefer the shortest version (like Nibalkar’s) and risk to loose the others, instead of having to encode all the rules and exceptions of the rules.
I’ve been trawling the web for a solution and could not find one, so I’ve written my own solution which matches any URL in all the usual formats such as http://www.google.com, http://www.google.com, as well as emails such as matt@test.com.
See my blog post about the regular expression including a test case at http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the
… and of course WordPress itself contains code that identifies URIs:
http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1413