sql server - Fuzzy grouping in SQL -


i need modify sql table group mismatched names, , assign elements in group standardized name.

for instance, if initial table looks this:

name -------- jon q john q jonn q mary w marie w matt h 

i create new table or add field existing 1 this:

name     | stdname -------------------- jon q    | jon q john q   | jon q jonn q   | jon q mary w   | mary w marie w  | mary w matt h   | matt h 

in case, i've chosen first name assign "standardized name," don't care 1 chosen -- final "standardized name" hashed unique person id. (i'm open alternative solutions go directly numerical id.) have birthdates match on well, accuracy of name matching doesn't need precise in practice. i've looked bit , use jaro-winkler algorithm (see e.g. here).

if knew names in pairs, relatively easy query, there can arbitrary number of same name.

i can conceptualize how query in procedural language, i'm not familiar sql. unfortunately don't have direct access data -- it's sensitive data , else (a bureaucrat) has run actual query me. specific implementation sql server, i'd prefer implementation-agnostic solution.

edit:

in response comment, had following procedural approach in mind. it's in python, , replaced jaro-winkler matching on first letter of name, sake of having working code example.

namelist = ['jon q', 'john q', 'jonn q', 'mary w', 'marie w', 'larry h'] stdlist = namelist[:]  # loop on names i1, name1 in enumerate(stdlist):    # loop on later names in list find matches   i2, name2 in enumerate(stdlist[i1+1:]):      # if there's match, replace latter former.     if (name1[0] == name2[0]):       stdlist[i1+1+i2] = name1  print stdlist 

the result ['jon q', 'jon q', 'jon q', 'mary w', 'mary w', 'larry h'].

assuming copy , paste jaro-winkler implementation ssc (registration required), following code work. tried build sqlfiddle kept going belly when building schema.

this implementation has cheat---i'm using cursor. generally, cursors not conducive performance in case, need able compare set against itself. there's graceful number/tally table approach eliminate declared cursor.

declare @src table (     source_string varchar(50) not null ,   ref_id int identity(1,1) not null );  -- identify matches declare @work table (     source_ref_id int not null ,   match_ref_id int not null );  insert     @src select 'jon q' union select 'john q' union select 'john q' union select 'jonn q' -- oops on matching joan jon union select 'joan q' union select 'june' union select 'mary w' union select 'marie w' union select 'matt h';  -- 2 problems address -- duplicates in our inbound set -- duplicates against reference set -- -- better matching occur if names split ordinal entities -- splitting on whitespace questionable -- -- mat, matt, matthew   declare csr cursor read_only  select distinct     s1.source_string ,   s1.ref_id     @src s1 order     s1.ref_id;  declare @source_string varchar(50), @ref_id int open csr  fetch next csr @source_string, @ref_id while (@@fetch_status <> -1) begin     if (@@fetch_status <> -2)     begin         if not exists         (             select * @work w w.match_ref_id = @ref_id         )         begin             insert                 @work             select                 @ref_id             ,   s.ref_id                             @src s                 -- if have matched value, skip                 left outer join                     @work w                     on w.match_ref_id = s.ref_id                             -- don't match                 s.ref_id <> @ref_id                 -- arbitrary threshold, need examine sanity                 , dbo.fn_calculatejarowinkler(@source_string, s.source_string) > .95         end     end     fetch next csr @source_string, @ref_id end  close csr  deallocate csr  -- show me list of unmatched rows  -- plus retained  ;with matches (     select          s1.source_string     ,   s1.ref_id     ,   s2.source_string match_source_string     ,   s2.ref_id match_ref_id              @src s1         inner join             @work w             on w.source_ref_id = s1.ref_id         inner join             @src s2             on s2.ref_id = w.match_ref_id ) , unmatches (     select          s1.source_string     ,   s1.ref_id     ,   null match_source_string     ,   null match_ref_id              @src s1         left outer join             @work w             on w.source_ref_id = s1.ref_id         left outer join             @work s2             on s2.match_ref_id = s1.ref_id             w.source_ref_id null         , s2.match_ref_id null ) select     m.source_string ,   m.ref_id ,   m.match_source_string ,   m.match_ref_id     matches m union select     m.source_string ,   m.ref_id ,   m.match_source_string ,   m.match_ref_id     unmatches m;  -- solve request  select     s.source_string name ,   coalesce(s2.source_string, s.source_string) stdname     @src s     left outer join         @work w         on w.match_ref_id = s.ref_id     left outer join         @src s2         on s2.ref_id = w.source_ref_id 

query output 1

source_string   ref_id  match_source_string match_ref_id jon q   1   john q  2 jon q   1   john q  3 jon q   1   jonn q  4 jon q   1   joan q  5 june    6   null    null mary w  7   null    null marie w 8   null    null matt h  9   null    null 

query output 2

name    stdname jon q   jon q john q  jon q john q  jon q jonn q  jon q joan q  jon q june    june mary w  mary w marie w marie w matt h  matt h 

there dragons

over on superuser, talked experience matching people. in section, i'll list things aware of.

speed

as part of matching, hooray in have birthday augment match process. propose generate match based exclusively on birthdate first. exact match , 1 that, proper index, sql server able include/exclude rows. because you're going need it. tsql implementation dog slow. i've been running equivalent match against dataset of 28k names (names had been listed conference attendees). there ought overlap there , while did fill @src data, table variable that implies it's been running 15 minutes , still hasn't completed.

it's slow number of reasons things jumped out @ me looping , string manipulation in functions. not sql server shines. if have need lot of this, might idea convert them clr methods @ least can leverage strength of .net libraries of manipulations.

one of matches used use double metaphone , generate pair of possible phonetic interpretations of name. instead of computing every time, compute once , store alongside name. speed of matching. unfortunately, doesn't jw lends breaking down that.

look @ iterating too. we'd first try algs knew fast. 'john' = 'john' there's no need pull out big guns we'd try first pass of straight name checks. if didn't find match, we'd try harder. hope taking various swipes @ matching we'd low hanging fruit fast possible , worry harder matches later.

names

in su answer , in code comments, mention nicknames. bill , billy going match. billy, liam , william not going match though may same person. might want @ list provide translation between nickname , full name. after running set of matches on supplied name, maybe we'd try looking match based on possible root name.

obviously, there draw backs approach. example, grandfather-in-law max. max. not maximilian, maximus or other things might thing.

your supplied names it's first , last concatenated together. future readers, if ever have opportunity capture individual portions of name, please so. there products out there split names , try match them against directories try , guess whether first/middle name or surname have people "robar mike". if saw name there, you'd think robar last name , you'd pronounce "robber." instead, robar (say french accent) first name , mike last name. @ rate, think you'll have better matching experience if can split first , last out separate fields , match individual pieces together. exact last name match plus partial first name match might suffice, in cases legally "franklin roosevelt" , have candidate of "f. roosevelt" perhaps have rule initial letter can match. or don't.

noise - referenced in jw post , answer, strip out crap (punctuation, stop words, etc) matching purposes. watch out honorific tites (phd, jd, etc) , generationals (ii, iii, jr, sr). our rule candidate with/without generational match 1 in opposite state (bob jones jr == bob jones) or match generation (bob jones sr = bob jones sr) you'd never want match if both records supplied them , conflicting (bob jones sr != bob jones jr).

case sensitivity, check database , tempdb make sure aren't making case sensitive matches. , if are, convert upper or lower purposes of matching don't ever throw supplied casing away. luck trying determine whether latessa should latessa, latessa or else.

my query coming on hour's worth of processing no rows returned i'm going kill , turn in. best of luck, happy matching.


Comments

Popular posts from this blog

java - Jmockit String final length method mocking Issue -

What is the difference between data design and data model(ERD) -

ios - Can NSManagedObject conform to NSCoding -