Update: see my most recent comment. WordPress has a pretty good regex for matching URLs.
A bold claim, but I think I’ve got one:
|([A-Za-z]{3,9})://([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?((/[-\+~%/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?)?|
An online events booking system I developed doesn’t allow HTML in the event description field, primarily to protect against annoying scripting attacks. But what if you want to provide a link in the description? I need to detect plain text URLs stored in the database, and turn them into hyperlinks when displayed in the browser. The regular expression above allows me to do that quite easily in PHP:
$pattern = |([A-Za-z]{3,9})://([-;:&=+$,w]+@{1})?([-A-Za-z0-9.]+)+:?(d+)?((/[-+~%/.w]+)???([-+=&;%@.w]+)?#?([w]+)?)?|;
$html = preg_replace($pattern, '<a href="$0">$0</a>', $text);
But the regular expression has several submatches. They provide a means to break down the URL into its constituent parts, including protocol, user info, server name, REQUEST_URI, query string and anchor.
Here’s a PHP class I wrote that uses this Regular Expression to analyse a string, detect URLs, populate an array with the constituent parts of the URL, and replace URLs with hyperlinks. Here’s an example of usage:
$text = 'Please visit http://www.example.com/cgi-bin/'; $text .= 'script.cgi?variable=value&variable2=some'; $text .= '+url+encoded+text#section-1 to find out more'; $urlf = new URLFinder(); $html = $urlf->make_links($text); echo $html;
If you print_r($urlf)
, you can see how the URL is broken down.
I haven’t managed to find any exceptions to the expression, but if you do, please post an example.
your regular expression can not match this url:
“http://i88.photobucket.com/albums/k163/thien0211photo/hinh chuyen de up/KienG20cau20Xeobuom20copy.jpg”
Thanks. But strictly speaking, the spaces in that URL should be encoded as %20.
http://www.faqs.org/rfcs/rfc1738.html
The application that uses this regular expression has to ignore whitespace, as it’s intended to detect URLs within longer strings that contain other text and convert them to links.
It looks like Photobucket correctly encodes URLs in their HTML, even if the %20 characters don’t show up in the address bar.
The Regular Expression does not match these, for example:
1. link inside of htmlspecialchars : >http://www.google.com<
2. this link : http://localhost:32000/?ch=1
Great! This saved me a nerve-racking evening (I’m a regular expression loser) 🙂
I’ve updated it to be a bit more restrictive in requiring domains to end in .xx instead, eliminated consecutive .’s in the domain name. Packaged : with the port and ? or & (because some sites don’t use ? for some reason) with the parameter data. I’ve also accounted for the problem Pavel mentioned above. Here’s my regex:
(([A-Za-z]{3,9})://)?([-;:&=\+\$,\w]+@{1})?(([-A-Za-z0-9]+\.)+[A-Za-z]{2,3})(:\d+)?((/[-\+~%/\.\w]+)?/?([&?][-\+=&;%@\.\w]+)?(#[\w]+)?)?
Thanks, Howie. I’d suggest the limiting repetition modifier for top-level domains should be {2,6} to allow for things like
.museum
.See the list of currently-implemented TLDs at the IANA:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Probles with your code & links like http://google.de/?aaa
http://google.de//?aaa
works mmh…
This is one of the problems Pavel found. In your example URL, the default script name is implicit, so can be omitted before the query string.
The problematic portion is
(/[-\+~%/\.\w]+)?
Adding a second / results in a match, as you’ve found. So this portion of the regex should be:
(/[-\+~%/\.\w]?)?
This will match the trailing slash after the FQDN, and optionally match any valid filename characters that follow.
The complete subpattern is optional, so things like http://www.google.de will still be matched despite the lack of a trailing slash.
I’ve tried all of these patterns to handle URLs from Flickr. For example,
http://www.flickr.com/photos/27703312@N04/
doing a simple replacement of the URL with the text “url” results in:
url@N04 or url@N04/
Looks like these strings don’t take the “@” into account.
David
Great, Thanks 🙂
Is this Worth ?
“(http|www)\S+(\s+)?\S+”
It doesn’t compile on http://www.regexplanet.com/simple/index.html. Also it requires the http part whereas many people omit that when they input a url
This is a thing that always surprises me: why the regular expressions for URL&email matching should be so long? I know that there are a lot of rules… but I prefer the shortest version (like Nibalkar’s) and risk to loose the others, instead of having to encode all the rules and exceptions of the rules.
I’ve been trawling the web for a solution and could not find one, so I’ve written my own solution which matches any URL in all the usual formats such as http://www.google.com, http://www.google.com, as well as emails such as matt@test.com.
See my blog post about the regular expression including a test case at http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the
… and of course WordPress itself contains code that identifies URIs:
http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1413
It’s a nice regex, but it fails (like every one I’ve found) with URLs in text (which is what I need) when they end with punctuation, for example: Have you visited http://bobsguides.com? or I like http://bobsguides.com.
The terminal punctuation is considered part of the URL
But WordPress correctly handles the terminal punctuation, as you can see. A question mark and a dot are both valid parts of a URL, except where they appear on their own.
You can find that regex here:
http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1569
Someday I’ll update this post to reflect all the suggestions people have made. We get a lot of traffic from Google on this post, and it’d be good to give them something they can use. In the meantime, try the WordPress regex.
I think this regular expression posted by Mr Jeff Atwood is the most correct one that can match known url’s:
\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
Tested against those url’s:
http://codecanyon.net/
http://woork.blogspot.com/2008/06/clean-and-pure-css-form-design.html
http://www.onextrapixel.com/2009/08/25/how-to-use-pure-css-to-style-web-form-dynamically-plus-12-awesome-javascript-plugins/
Thank you very much!
That regex was the only one that worked with my very edged use case 😉
Regards from Brazil.
I have looked around the web for URL pattern matching, but unfortunately all of theme failed against the tested URL that may take this form:
http://www.example.com/etcetc.
https://www.example.com/etcetc.
http://example.com/etcetc.
www.example.com/etcetc
example.com/etcetc
user:pass@example.com/etcetc
http://i88.photobucket.com/albums/k163/thien0211photo/hinh chuyen de up/KienG20cau20Xeobuom20copy.jpg
http://www.google.com<
http://google.de/?aaa
http://google.de//?aaa
http://www.flickr.com/photos/27703312@N04/
http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L1413
http://bobsguides.com?
http://bobsguides.com.
http://woork.blogspot.com/2008/06/clean-and-pure-css-form-design.html
http://www.onextrapixel.com/2009/08/25/how-to-use-pure-css-to-style-web-form-dynamically-plus-12-awesome-javascript-plugins/
So, I wrote my own and make sure that it will pass the test, please comment if you know an URL can’t be matched by this pattern:
$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#& ]+\.([a-zA-Z0-9\.\/\?\:@\-_=#& ])*/';
I think you should escape the slashes there. I create a regex at a live regex tester using the regex in last comment and it’s working.
https://www.liveregex.com/opZ2k