Normalize the names in a huge table using UTL_MATCH

Hello

I have a large table (350 million records) with a "full name" column

This column has a few typos, so I have to 'normalise' the data (only for this column), using UTL_MATCH. JARO_WINKLER_SIMILARITY.

I did some tests with a small table, and it works to show the similar names:

SELECT b.SID, b.name FROM typotable a, typotable b utl_match.jaro_winkler_similarity (b.SID, b.name) WHERE BETWEEN 85 and 99 AND a.rowid > b.rowid;


But:


(1) the test table was small, by using this code directly on the 350 million accounts table take ages... What can be done about it?


(2) this shows just the similar names. How can I update the table by searching for similarities, choose one of them as the only value for each name?




Thank you

1590733 wrote:

Yes, I get your point. The thing is that there is no "correct" available names and the original table is huge, that's what I thought:

-Create a table of secondary NAMES, with unique names. These names would have been generated by match the values similar to one of them (but always the same, no matter if is not one that suits). This should be equivalent to your table 'correctness '.

-Run the cleaning procedure for updating records

How can I create this secondary NAMES table? (The column 'genre' is not serious at all, that the 'name' must be set)

Thanks for your help

Well, you need to determine what is the logic that would pick one of the incorrect names on the other.  In its current version, you can easily get two incorrect values having the same value of match.  But then you must also consider what creates a 'group' of values that you can get the best in the group.  Using the match itself is not enough to create groups.

Example:

SQL > ed

A written file afiedt.buf

1 Select a.fname as $fname1, b.fname as fname2

2, utl_match.jaro_winkler_similarity (a.fname, b.fname) as a match

3 typotable one

4 join typotable b on (a.fname! = b.fname)

where the 5 utl_match.jaro_winkler_similarity (a.fname, b.fname) > = 85

6 * 1.3 desc order

SQL > /.

$FNAME1 FNAME2 MATCH

---------- ---------- ----------

FROCEN FROZEN 92

FROZEN FROCEN 92

FROZEN FROCEN 92

FROZEN FROZEN 92

JELLY FROZIN 93

JELLY FROCEN 92

FROZEN FROZEN 92

FROZEN FROZIN 93

WHIPLASH WIPLASH 96

WHIPLASH WIPLASH 96

10 selected lines.

As you can see, for example, FROCEN has two possible variants, both with a football match of the 92.  The same with others.

However, you could start cutting things around (and it's really a hack) to get something like:

SQL > ed

A written file afiedt.buf

1 with t as)

2. Select a.fname as $fname1, b.fname as fname2

3, utl_match.jaro_winkler_similarity (a.fname, b.fname) as a match

typotable a 4

5 join typotable b on (a.fname! = b.fname)

where the 6 utl_match.jaro_winkler_similarity (a.fname, b.fname) > = 85

7       )

8, ch. as)

9 select $fname1, ($fname1, fname2) greatest as fname2, match

10, (select count (*)

11 t t2

12 where t2.fname2 = t.fname2

13 and t2.fname1! = t.fname1

(14) as the NTC

15 t

16       )

17, r as)

18 select $fname1, fname2, match, cnt

19, row_number() over (partition by $fname1 by cnt desc, desc match order): the nurse

20 c

21       )

22 select $fname1, fname2

23 r

where the 24 rn = 1

25 * order by 1

SQL > /.

$FNAME1 FNAME2

---------- ----------

FROZEN FROCEN

FROZEN FROZEN

FROZEN FROZEN

FROZIN FROZIN

WHIPLASH WIPLASH

WIPLASH WIPLASH

6 selected lines.

but then it depends on your data as to if it will work in all circumstances

Tags: Database

Similar Questions

  • How to change the name of column in table effectively

    Hello

    I use JDev 11.1.1.4.0.

    My JSF page is done with 3 or 4 EOs, your and LOVs. However, one of the EO tables has changed its column name. The name change is important for the standard project.

    I remember having a lot of harm in changing a prior column name (I had to start over), so I would ask the experts take on how to do it without too much impact.

    This particular column is an OT, VO and lives as a field test on a JSF page.

    All comments or suggestions will be greatly appreciated.

    Thank you

    Bones Jones

    Indeed, the only thing you would have to change is the ColumnName of the object attribute specific entity (name of column of data in the attribute from entity dialog box). Now, if you used sync with databases... feature you will need to go through a full process of refactoring. In this case, use the menu refactor in JDeveloper. See if anything on this post can help you to: http://jdeveloperfaq.blogspot.com/2010/04/faq-20-how-to-refactor-adf-components.html

  • How to change the names of slide in Table of contents

    Hello

    How can I change the name of the slide in my table of contents? Slide 1, 2, 3, etc. is not very informative. I went into the properties of the slide and added labels, but who has not updated the names of table of contents slide.

    Any help would be appreciated. Thank you.

    Hello

    Perhaps try to click Reset TOC?

    Click on project > Table of contents.

    Click Reset TOC.

    Click on the image below to see full size.

    See you soon... Rick

    Useful and practical links

  • How to get a list of the names of a query table?

    Hello

    I use Oracle 10 g. I have about 100 SQL queries stored in a table. I would like to know if there is an easy way to retrieve the source tables in each query.
    For example:
    I have a query "SELECT * FROM Table1 t1 INNER JOIN Table2 t2 ON t1.col1 = t2.col1.
    This query, I would automatically get a list of tables:
    Table1:
    Table2

    Thanks in advance for your collaboration.

    Best regards
    Beroetz

    This query, I would automatically get a list of tables:

    Make a plan to explain on the query.

    The name of the object will be in the OBJECT_NAME column in your PLAN_TABLE.

    But the name of the object can be a table or an index so you'll need to join the user_objects this name to see if it is a table or index name.

    You will also need to take into account those moments where a query can be satisfied by only using the index. You can always get the name of the table glancing user_indexes.

  • How the names of variables and units used in the binary output file

    My colleague will give me LabView generated from the binary files (*.dat). There are more than 60 variables (columns) in the binary output file. I need to know the names of variables and units, which I think he has already configured in LabView. Is there a way for him to produce a file that contains the name of the variable and unity, so that I'll know what contains the binary file? It can create an equivalent ASCII file with a header indicating the name of the variable, but it does not list the units of each variable.

    As you can tell I'm not a user of LabView, so I apologize if this question makes no sense.

    Hi KE,.

    an ASCII (probably the csv format) file is just text - and contains all data (intentially) written to. There is no special function to include units or whatever!

    Your colleague must save the information it records the names and values in the same way...

    (When writing to text files, it could use WriteTextFile, FormatIntoFile, WriteToSpreadsheetFile, WriteBinaryFile even could serve...)

  • How to reference the names of columns, if you use select *.

    Hello

    How to reference the names of columns to get out of the data, when you use select * and not aware of the column names (and number of columns) in advance.

    Even if I could get the column names in the other variables. I am new to CF so question may be stupid.

    getting column names: -.

    < cfquery datasource = "RTW_ORA" name = "cn" >
    SELECT COLUMN_NAME
    OF ALL_COL_COMMENTS
    WHERE TABLE_NAME = ' #meas #
    < / cfquery >

    obtain data: -.

    < cfquery datasource = "RTW_ORA" name = "cd" >
    SELECT *.
    To #meas #.
    < / cfquery >

    How do all the output data?

    Any help would be much appreciated!

    Thank you

    Tushar Saxena

    How to reference the names of columns to get out of the data, when you use select * and not aware of the column names (and number of columns) in advance.

    Even if I could get the column names in the other variables. I am new to CF so question may be stupid. getting column names: -.


    SELECT COLUMN_NAME
    OF ALL_COL_COMMENTS
    WHERE TABLE_NAME = ' #meas #

    obtain data: -.


    SELECT *.
    To #meas #.

    How do all the output data?

    Your question is not stupid. You can use the concept of a query requestand their properties cfquery attributes name and result.


    SELECT *.
    To #meas #.






    column names: #column_names #.

    number of columns: #no_of_columns #.



    SELECT #column_names #.
    FROM the cd



    A SQL query: #resQoQ.sql #.

    Query:


       
       
    #column #: #cd [column] [currentrow] #.
       


    T/t:


       
    #column #: #QoQ [column] [currentrow] #.
       


  • I want to loop through the data from two different tables using for loop where the query should be replaced at runtime, please help me

    I have the data into two table with the structure of similar column, I want to loop through the data in these two tables

    based on some condition and runtime that I want to put the query in loop for example, the example is given, please help me

    create table ab (a number, b varchar2 (20));

    Insert into ab

    Select rownum, rownum. "" sample "

    of the double

    connect by level < = 10

    create table bc (a number, b varchar2 (20));

    Insert into BC.

    Select rownum + 1, rownum + 1 | "" sample "

    of the double

    connect by level < = 10

    declare

    l_statement varchar2 (2000);

    Boolean bool;

    Start

    bool: = true;

    If it is true, then

    l_statement: =' select * ab ';

    on the other

    l_statement: =' select * from bc';

    end if

    I'm in execute immediate l_statement - something like that, but I don't know

    loop

    dbms_output.put_line (i.a);

    end loop;

    end;

    Something like that, but this isn't a peace of the code work.

    Try this and adapt according to your needs:

    declare

    l_statement varchar2 (2000);

    c SYS_REFCURSOR;

    l_a number;

    l_b varchar2 (20);

    Boolean bool;

    Start

    bool: = true;

    If it is true, then

    l_statement: = "select a, b, AB;

    on the other

    l_statement: = "select a, b from bc;

    end if;

    --

    Open c for l_statement;

    --

    loop

    extract the c in l_a, l_b;

    When the output c % notfound;

    dbms_output.put_line (l_a |') -' || l_b);

    end loop;

    close c;

    end;

    /

  • On facebook I can't watch my friends with the name of Tracey I can using my phone app I can with Chrome etc but not on my PC Tower firfox

    On facebook I can't watch my friends with Tracey name I can use my phone app I can with Chrome etc but not using Firefox on my PC Tower

    Hello

    To better help you with your question, please provide us with a screenshot. If you need help to create a screenshot, please see How to make a screenshot of my problem?

    Once you have done so, attach the file to screen shot saved to your post on the forum by clicking on the button Browse... under the box to post your reply . This will help us to visualize the problem.

    Thank you!

  • Best way to update all the lines of a huge table

    Hi all

    I'm on Oracle 11.2 Enterprise edition.

    I asked a question, "assumes that there is a large table, say 3 GB size.» We need to update all rows in this table (e.g. col3 = col3 * 2), what is the best way to do this? »

    My answer was, there are 2 possible ways, depending on the number of indexes on the table

    (1) if there is a lot (or wholesale) index on the table OR the database is in FORCE LOGGING mode (due to log shipping), I just run an UPDATE command on the table and update all records. This will generate a lot of redo and undo, but we cannot do anything for indexes

    (2) if there is not a lot of clues, I'll create an empty table of the same structure (create table t2 nologging in select * from t1 where 1 = 2). Then, using an INSERT... AS SELECT... command, with the ' col3 * 2 "instead of col3, I insert any data into the new table with APPEND tip. Then create all indexes on the new table and finally, rename T2 T1

    In case the first solution, I can avoid the recreation of the index and save this I/O. In the case of newspapers, first solution makes more sense anyway.

    Second solution is logical, if we have the freedom of creating objects in NOLOGGING mode.

    What do you experts? I think in the right direction? or what?

    Thanks in advance

    Hello

    This should probably help you.

    Kind regards

    Suntrupth

  • the names of the plots on the chart and use these channels in the menu of the ring

    Hello everyone!

    I table 1 d which is more a cluster of 2 elements: one is a number and other string. These string contains the information on the name of signals. I connected this table 1 d to reshape the array with a dimension of 10.  Then the consistent table is connected to the table in the cluster. This cluster has led is still naked to get the name of the plots. The problem is that I don't want these 9 unbundle blocks to get the name of the plots. Is there anyway I can do it without use of unbundling 9 times. I though that the use of loop for or while loop, but I need some suggestions.

    So I have two questions:

    How to get the name of plots is without using unbundle so many times?

    Second is how to display the names of these plots on my menu ring?

    I must have missed something, I didn't see any large cluster on your drawing. Change this large cluster in a table, because there a lot of the same element. Then proceed as attaché.

  • How to get the name of the particular index table option.

    Hello

    Can any body tell how to get the name of the item to a particular array.i have a table within array.i must compare the name of Francesca in particular key.here is the table.

    myArray= Array (@43b1e09)
    [0] = object (@42b33f9)
    Testing_1 = Array (@4428821)
    [0] = object (@43adc19)
    choice_id = '0 '.
    delete = "N".
    DownloadURL = "xyz".
    selected = 'Y '.
    translation = "2_486."
    length = 1
    length = "N".
    Editable = 'Y '.
    field_id = '388 '.
    LanguageLink = 'Y '.
    linked_definition_id = null
    multiple values = "N".
    name = "Photo".
    otheroption = "N".
    photovitlink = object (@43ad0d9)
    required = "N".
    step = '1 '.
    translation = "Photo".
    visible = 'Y '.
    [1] = object (@43ad5d9)
    [2] = object (@4490089)

    Here is the structure of the table I get server side.i give table name of result as table myArray.This have several child as an object of object.each having .i table (Testing_1 in the first case) must get the name of this Testing_1 table and compare with my sort key that I perform an operation. But I am unable to get the name of this Testing_1 array(Since_it_is_dynamic_so_this_name_changes_some_times).can a body guide me how to get the name of this table.



    Thanks and greetings

    Vineet Sharma

    Hi Vineet Osho,

    You can browse your object using the loop and you can get the name of the table... as below...

    for each (var obj:Object in myArray)
    {
    for (var str:String in obj)
    {
    If (obj [str] is array)
    {
    var arrayName:String = str;
    }
    }
    }

    Thank you

    Jean Claude

  • How to get all the names a table display

    Hi all
    I try to get the names of all the points of view of a table. I tried to use the user_views table, but there is no column by specifying the name of the table.
    Is someone can you please tell me how I can get all the names display in a table.

    Thank you

    You will need to join with USER_VIEWS USER_DEPENDENCIES for the list of dependent views on a particular table.

  • Get - VM: with "xyz" name VM is not found using the specified filters.

    Hello everyone!

    I have a script that reads a CSV with multiple host names, to connect to vCenter and should get the name of host, ip and PortGroup information to generate an external file of CSV.
    It happens that several cases vm does not exist in vCenter and returns me the below error:

    Get - vm: 04/07/2016-14:41:58Get - VMVM with the name "pxl1sso00008" was not found using the specified filters.

    No caractere:31 of C:\Users\f3135606\Desktop\vmTeste1.ps1:25

    + foreach ($vmName in $vmList) {get - vm $vmName |} Select Name, @{N = "Network"; e = {$_...}}

    +~~~~~~~~~~~~~~
    + CategoryInfo: ObjectNotFound: (:)) [Get - VM], VimException)
    + FullyQualifiedErrorId: Core_OutputHelper_WriteNotFoundError, VMware.VimAutomation.ViCore.Cmdlets.Commands.GetVM



    How would I do to get the information from an input file, see the vCenter and if this positive results write the file, ignoring errors. Ideally, if not to find the machine, simply create a line in the output file with only the hostname with the rest in white.

    Follow the .ps1 file:



    $vmlist = Get-Content C:\vmnames.csv
    
    if (!(Get-PSSnapin -Name VMware.VimAutomation.Core -ErrorAction SilentlyContinue))
    {
    Add-PSSnapin VMware*
    Set-PowerCLIConfiguration -DisplayDeprecationWarnings $false -DefaultVIServerMode multiple -InvalidCertificateAction Ignore -Scope Session -ProxyPolicy NoProxy -Confirm:$false | Out-Null  
    [void](Get-PSSnapin VMWare.VimAutomation.Core -ErrorVariable getVmwareSnapinErr 2> $null)
    if ($getVmwareSnapinErr.Count -gt 0) {    Add-PSSnapin VMware.VimAutomation.Core }
    }
        
    $VCconn = Connect-VIServer $vCenter -User $vUsuario -Password $vPass > $null
         
    foreach ($vmName in $vmList) {get-vm $vmName| Select Name, @{N="Network"; e={ $_ | get-networkadapter|Select-Object @{N="Network";E={$_.NetworkName}}} }, @{N="IP Address";E={@($_.guest.IPAddress[0])}}|Export-Csv –path c:\scripts\vlans.csv –NoTypeInformation}
    

    Laurent,

    See below... should get what you want... If the virtual machine is not found that it only allows to correct the virtual computer name in the output.

    $vmlist = Get-Content C:\vmnames.csv  
    
    if (!(Get-PSSnapin -Name VMware.VimAutomation.Core -ErrorAction SilentlyContinue))  {
        Add-PSSnapin VMware*
        Set-PowerCLIConfiguration -DisplayDeprecationWarnings $false -DefaultVIServerMode multiple -InvalidCertificateAction Ignore -Scope Session -ProxyPolicy NoProxy -Confirm:$false | Out-Null
        [void](Get-PSSnapin VMWare.VimAutomation.Core -ErrorVariable getVmwareSnapinErr 2> $null)
        if ($getVmwareSnapinErr.Count -gt 0) {
            Add-PSSnapin VMware.VimAutomation.Core
        }
    }
    
    $VCconn = Connect-VIServer $vCenter -User $vUsuario -Password $vPass > $null
    $arrVMInfo = @()
    
    foreach ($vmName in $vmList) {
        $vm = get-vm $vmName -ErrorAction SilentlyContinue -ErrorVariable VMError | Select Name, @{N="Network";E={ $_ | get-networkadapter | Select-Object @{N="Network";E={$_.NetworkName}}} }, @{N="IPAddress";E={@($_.guest.IPAddress[0])}}
        if ($vm -eq $null) {
            $arrVMInfo += New-Object PSObject -Property @{ `
                Name=$vmName `
            }
        }
        else {
            $arrVMInfo += New-Object PSObject -Property @{ `
                "Name"=$vm.name; `
                "Network"=$vm.Network.Network; `
                "IP Address"=$vm.IPAddress `
            }
        }
    }
    $arrVMInfo | Select Name, Network, "IP Address" | Export-Csv "c:\scripts\vlans.csv" -NoTypeInformation
    
  • Direct e-mail does not include the "name &#60; e-addrɬ" format of e-mail address if the e-addr is not already in the list of contacts

    After entering the Ctrl-F to save the project email all in composition, if the recipients in the To: / Cc: / Bcc: fields are not already in the list of contacts, the live tool Email would remove the of "name " format, causes the email undeliverable due to unknown recipients. BTW, is there a way to set Live automatic email saves the recipient from the list of contacts if the ' name ; ' e-address format are used to: field?

    See: a question of Hotmail or Windows Live Mail?
    http://answers.Microsoft.com/en-us/Windows/Forum/Windows_7-networking/have-a-Windows-Live-Mail-or-Hotmail-question/8bd31c48-d1a7-49D6-a08c-9069aaeba2e5

  • The name of this font?

    Can someone tell me the name of this font, I used years ago please? I work on homework for my graphic design class and want to use the same font, but do not remember the name of it and it is not in my favorites or downloaded list.

    Panda Art.jpg

    KZ BLADERUNNER 1 fonts | WhatFontis.com

    Fenja

Maybe you are looking for