Two weeks ago I was at the first Dutch SqlZaterdag which is the Dutch version of SQLBits or SqlSaturday. It was a great day and almost 200 people showed up. I’m sure there will be another SqlZaterdag.
This post is a reply to the session from @andrekamman and Henk van der Valk who has put the presentation online. In their session there was a question: What is the fastest way to count the words in a string? The correct answer according to Andre and Henk was:
select LEN(@s) - LEN(replace(@s, ' ', ''))+1;
Which indeed does the trick. If @s is “To be or not to be, that is the question.” it will return 10. But what if someone didn’t put a space after the colon or put two spaces. A typo is easily made. To test it I used this script:
declare @start datetime2 = SYSDATETIME();
declare @end datetime2;
declare @s varchar(100);
declare @s1 varchar(100);
declare @s2 varchar(100);
set @s = 'To be or not to be, that is the question.';
set @s1 = 'To be or not te be,that is the question.';
set @s2 = 'To be or not to be, that is the question.';
select LEN(@s) - LEN(replace(@s, ' ', ''))+1;
select LEN(@s1) - LEN(replace(@s1, ' ', ''))+1;
select LEN(@s2) - LEN(replace(@s2, ' ', ''))+1;
set @end = SYSDATETIME();
select @start;
select @end;
select DATEDIFF(MICROSECOND, @start, @end);
The first time it takes a wile to run but the second time it will run much faster. The problem is that this will return three different outcomes, 10,9 and 11. So, yes it maybe the fastest way to count words in a string but not the most precise. But then again, if that’s not what you’re after, this is it.
If you would ask a developer for a solution he/she will say Regular Expressions. So, how do we get these in T-SQL. Yes, by using a CLR function.
Fire up Visual Studio 2008 and create a new Database – SQL Server Project. Give it a name, in my case I named it CountWords. Select your project and choose Add – User-Defined Function
Name the function RegExCount and replace the contents with this code.
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlInt32 RegExCount(string pattern, string message)
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
int i = r.Matches(message).Count;
return new SqlInt32(i);
}
};
This CLR function will accept a pattern and a search string. It will then count the occurrences of the pattern in the string and return it. Great thing about Visual Studio is that you can deploy this function directly on your SQL Server. So click Build and then Deploy to get the DLL inside your database. Or do it the proper way, build it and use CREATE ASSEMBLY to add it to a database.
Now run the following script:
declare @start datetime2 = SYSDATETIME();
declare @end datetime2;
declare @s varchar(100);
declare @s1 varchar(100);
declare @s2 varchar(100);
set @s = 'To be or not to be, that is the question.';
set @s1 = 'To be or not te be,that is the question.';
set @s2 = 'To be or not to be, that is the question.';
select dbo.RegExCount('(\s+)|(\w,\w)|(\w\.\w)', @s)+1;
select dbo.RegExCount('(\s+)|(\w,\w)|(\w\.\w)', @s1)+1;
select dbo.RegExCount('(\s+)|(\w,\w)|(\w\.\w)', @s2)+1;
set @end = SYSDATETIME();
select @start;
select @end;
select DATEDIFF(MICROSECOND, @start, @end);
We feed the CLR-function with a RegEx Pattern:
- \s+: Matches one or more whitespaces.
- \w,\w: Matches a character followed by a colon followed by a character.
- \w\.\w: Matches a character followed by a period followed by a character.
The result will be 10 for all three strings and although the first time it will take some time (the DLL needs to be compiled), subsequent runs are faster. So is this the fastest way? Maybe not but it is the more accurate one. And the pattern can be extended to catch more typo’s.
But still, SqlZaterdag was great!!!