tzzz


Levenshtein Edit Distance algorithm implemented as a native function for MySQL 5
December 2, 2005, 7:26 pm
Filed under: Edit distance, fuzzy search, Levenshtein, MySQL

Save This Page

Levenshtein Edit Distance algorithm implemented as a native function for MySQL 5 by Matuszczyk and Runeborg, Dec 2 2005

INTRODUCTION

The Levenshtein Distance (LD) is a measurement of the difference between two strings. It is also commonly reffered to as the Edit Distance. The distance is a positive integer where the value represents the number of deletions, insertions or substitutions that need to be made to the source string original of length lo in order to transform it into the target string target of length lt. A more thorough description of how the original algorithm works is available here.

In this implementation of the algorithm, the distance is calculated using an array of linear size (lo + lt).

USAGE

EDIT_DISTANCE(original, target, max_distance)

Returns:
 - int max_distance + 1 if LD > max_distance
 - int LD between original and target on success
 - null on error

INSTALLATION

To install this function on your MySQL server follow the steps below. A precondition is that you have the full source code for MySQL 5 on your computer. You can alo download the Complete README and source code.

Step 1:

:: << In sql/lex.h >>
:: Add the following line to "static SYMBOL sql_functions[]"
{ "EDIT_DISTANCE",    F_SYM(FUNC_ARG3),0,CREATE_FUNC(create_func_edit_distance)},

Step 2:

:: << In sql/item_create.h >>
:: Add the following
Item *create_func_edit_distance(Item* a, Item *b, Item *c);

Step 3:

:: << In sql/item_create.cc >>
Item *create_func_edit_distance(Item* a, Item* b, Item* c)
{
  return new Item_func_edit_distance(a,b,c);
}

Step 4:

:: << In sql/item_func.h >>
:: Add the following
class Item_func_edit_distance :public Item_int_func
{
  String value1, value2;
public:
  Item_func_edit_distance(Item *a, Item *b, Item *c) :Item_int_func(a, b, c) {}
  longlong val_int();
  const char *func_name() const { return "edit_distance"; }
  void fix_length_and_dec() { max_length=10; }
};

Step 5:

:: << In sql/item_func.cc >>
:: Add the following include
#include  <memory.h>

:: Add the following further down among the other functions
longlong Item_func_edit_distance::val_int()
{

/*
   * Levenshtein Edit Distance algorithm in linear space
   *
   * The Levenshtein Distance (LD) is a measurement of the difference
   * between two strings. It is also commonly reffered to as the Edit
   * Distance. The distance is a positive integer where the value
   * represents the number of deletions, insertions or substitutions
   * that need to be made to the source string original of length lo
   * in order to transform it into the target string target of length
   * lt. A more thorough description of how the original algorithm
   * works is available here.
   *
   * In this implementation of the algorithm, the distance is
   * calculated using an array of linear size (lo + lt).
   *
   * (C) Copyright 2005, Tomasz Matuszczyk
   * (C) Copyright 2005, Christian Runeborg
   *
   * Based on Craig Hendersons implementation of edit_distance for Boost.
   * Documentation and original implementation available at:
   * http://cvs.sourceforge.net/viewcvs.py/boost-sandbox/boost-sandbox/boost/sequence_algo/
   *
   * Permission to use, copy, modify, distribute and sell this software
   * and its documentation for any purpose is hereby granted without fee,
   * provided that the above copyright notice appears in all copies and
   * that both that copyright notice and this permission notice appear
   * in supporting documentation.  The author makes no representations
   * about the suitability of this software for any purpose.  It is
   * provided "as is" without express or implied warranty.
   */

DBUG_ASSERT(fixed == 1);
  String *original = args[0]->val_str(&value1);
  String *compare = args[1]->val_str(&value2);
  int max_distance = (int) args[2]->val_int();

// Check for nulls and return null
  if (!original || !compare)
  {
    null_value=1;
    return 0;
  }

null_value=0;

// Get string lengths
  int original_length = original->length();
  int compare_length = compare->length();

// If one of the strings is empty return the other ones length as distance
  if (original_length == 0) return (longlong) compare_length;
  if (compare_length == 0) return (longlong) original_length;

// Return max_distance + 1 if length difference between strings is higher than max_distance
  if (abs(original_length - compare_length) > max_distance) return (longlong) (++max_distance);

// Iterators
  char *original_start = original->c_ptr();
  char *original_end = original_start + original_length;
  char *compare_start = compare->c_ptr();
  char *compare_end = compare_start + compare_length;

// Measure the distance

int size_first = (int)(original_end - original_start);
  int size_second = (int)(compare_end - compare_start);
  int array_size = (size_first + 1) << 1;

int *pA = new int[array_size];
  int *pB = pA + size_first + 1;
  memset(pA, 0, sizeof(int)*array_size);

int col;
  for(col = 1; col <= size_first; ++col)
    pA[col] = col;

char *p2 = compare_start;
  for(int row = 1; row <= (size_second + 1); ++row, ++p2)
  {
    char *p1 = original_start;
	for(col = 1; col <= size_first; ++col, ++p1)
	{
	  pB[0] = row;
	  pB[col] = min( min( pB[col-1] + 1, pA[col] +1), pA[col-1] + ((*p1 == *p2) ? 0 : 1));
	}

int *pT = pA;
	pA = pB;
	pB = pT;
  }

int result = pB[size_first];
  delete [] pA;

return (longlong)result;
}

TESTING

This implementation has been compiled and tested with the MySQL 5.0.16 source and Apples gcc compiler (powerpc-apple-darwin8-gcc-4.0.0 (GCC) 4.0.0 20041026 (Apple Computer, Inc. build 4061).)

DISCLAIMER

(C) Copyright 2005, Tomasz Matuszczyk

(C) Copyright 2005, Christian Runeborg

Based on Craig Hendersons implementation of edit_distance for Boost. Documentation and original implementation available at: http://cvs.sourceforge.net/viewcvs.py/boost-sandbox/boost-sandbox/boost/sequence _algo/

Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation. The author makes no representations about the suitability of this software for any purpose. It is provided “as is” without express or implied warranty.