Amazon

Saturday 18 June 2011

Texts

Searching

  • Aho-Corasick. Search in a text by building a table from words.
  • Bitap (or shift-or, shift-and, Baeza-Yates-Gonnet). Fuzzy string searching algorithm developed by Udi Manber and Sun Wu.
  • Boyer-Moore string search. Search in text by skipping sub-string not containing letters in the searched input.
  • Knuth-Morris-Pratt. Build a table when searching to skip sub-string.
  • Rabin-Karp string search. Use hashing for multiple searches.
  • Longest common subsequence problem. Haskell's algorithm. Of two sequences.
  • Longest increasing subsequence problem. Of two sequences. It also reduces to find the longest path in a directed acyclic graph.
  • Shortest common supersequence. Of two sequences.
  • Horspool. Simplification of the Boyer-Moore algorithm. O(mn).

Approximate matching

  • Levenshtein distance (or edit distance). Minimum number of operations (insertion, deletion, replacement) needed to transform one string into the other.
  • Soundex. Phonetic algorithm for indexing words by their sound (in English).
  • Metaphone. Indexing words by their sound (in English).
  • NYSIIS. (New York State Identification and Intelligence System). Phonetic algorithm that improves soundex.

Word processing

  • Latent Dirichlet Allocation (LDA). Analysis of documents to associate the content with a topic. Used by search engines.
  • Latent Semantic Indexing (LSI). Automation of methods to attach a text to a topic from the words that occur commonly in this context.
  • Stemming. A method of reducing words to their stem, or root form.