Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Math Technology

Tracing the Limits of Computation 82

An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.

Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
This discussion has been archived. No new comments can be posted.

Tracing the Limits of Computation

Comments Filter:
  • by Ihlosi ( 895663 ) on Wednesday September 30, 2015 @04:47AM (#50625533)
    The algorithm may not be very clever, but it scales almost perfectly with the number of cores (as long as you don't have more cores than characters in the string).

    Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.

    • by Required Snark ( 1702878 ) on Wednesday September 30, 2015 @05:00AM (#50625563)
      The formal description of the edit distance [wikipedia.org] is different from that described in the article. This is the one used in some cases of DNA or text comparison.

      Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

      Deletion of a single symbol changes uxv to uv (x->e).

      Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

      It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

      It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

      • by owl57 ( 3467573 )

        Definition from TFA: For any two sequences x, y, EDIT(x, y) is equal to the minimum, over all sequences z, of the number of deletions and substitutions needed to transform x into z and y into z. This is equivalent to the one you cited: insertion into x is the same thing as deletion from y.

      • The formal description of the edit distance [wikipedia.org] is different from that described in the article. This is the one used in some cases of DNA or text comparison.

        Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string.

        Deletion of a single symbol changes uxv to uv (x->e).

        Substitution of a single symbol x for a symbol y x changes uxv to uyv (x->y).

        It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.

        It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.

        That does make the simple hardware a little trickier, although for specific purposes there are some tricks you may be able to design for practically that become more efficient as the size of the insert grows. (An entire gene, for example, would be easier per gene to change than a single base pair).

        Basically you can make a list of the changes. If you know you inserted a big object at place X, you can dynamically generate any requested index without having to rewrite the whole really long string, for exampl

    • No not really. It is easily parallelized yes. But you really don't improve big O speed like you do with sort algorithms.
      String compare will always be O(n)
      Now based of of the data you may be able to speed it up say your string of data contains particular patterns where you can group as a one letter such as compressing the string and compare the two compressed data, or say with DNA you just compress the data to 4 bit for value so you are searching more optimally. However you are still going in linear speed.

      Th

    • The human brain is known to take shortcuts to do string comparisons so why can't a computer to a certain degree? If memory serves, and I apologize if I'm off it's the end of a long day, when reading English at least the brain will read the first letter and the last letter and approximate the length of word to make the connection to the word stored in the brain. This apparently works up to a character limit then the brain chunks the words so compliment and complement might get confused as cplt+length for b

      • by Anonymous Coward

        Read and understand the problem and your answer will become trivial. The problem you present in your post has nothing to do with the string comparison algorithm in general (or the sequence alignment problem in particular).

      • by Sique ( 173459 )
        Especially for DNA comparisation, this shortcut won't work. Moreso, it is antithetic to the task at hand. We want to know the edit distance, or in DNA speak, we want to know how many mutation we need to get from DNA1 to DNA2, for instance to determine relation. The shortcut that we just look if all letters are present and begin and end are ok, and don't care about their sequence in the middle of the word doesn't give us the edit distance, as it hides exactly the answer to the question we asked to begin with
        • by durrr ( 1316311 )

          DNA mutations are more common in certain places(don't ask why, probably related to the geometry of packaged DNA and some other factors). An algo could spot check these places first to give preliminary answers but it would still take until it's finished to get a true answer.

      • This is one of many diffs (ahem) between DNA sequencing and natural language processing. Another is the alphabet size: DNA has an alphabet size of 4, while the alphabets (number of phonemes or graphemes) of natural languages range from a low of 11 (Rotokas and Mura) to a high of perhaps 140-something (!Xu, although the exact number depends on how you count things). Of course written Chinese and Japanese have much higher numbers of graphemes.

        It's also the case that some writing systems don't mark word bounda

        • When I was thinking of how it could apply the spacing is the "on/off switch" between genes, the individual genes are the "words". It wasn't overly well thought out, just thinking of how one could use known patterns to shorten processing time.

    • So don't compare only 2 DNA strings at once, or you can compromise and get good imperfect results. Every time you discover a matching sequence, you can leverage the information you have gained from earlier comparisons. DASH: Searching Compressed DNA [amazon.com] (disclosure; A friend's PHD thesis).

  • This reminds me... (Score:5, Interesting)

    by Dutch Gun ( 899105 ) on Wednesday September 30, 2015 @05:03AM (#50625575)

    This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.

    When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.

    However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.

    It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.

    • by Anonymous Coward

      However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm

      Then you would think I'm a fucking "genius". When I was 10 years old I implemented Jump Lists for my string search algorithm. Not because my 33Mhz system was too slow for the naive implementation, but because I was tired of waiting for QBASIC to stop thrashing memory. I was implementing my own native x86 compiler using the only interface I had available (QBASIC). A C compiler cost 1/3rd of what a PC did back then. So, without prior training in CS I imagined how one might transform a text program into m

      • by Zeroko ( 880939 )

        So I was not the only one to do that repeated-string-mapping method of compilation (although I did mine in high school (using C for bootstrapping, rather than QBASIC, IIRC)...earlier I was too busy playing with Visual Basic, which had its own compiler)...interesting.

        I never tried to make my compiler optimizing, despite having a plan in my head to do so, because the binary for the compiler itself already fit in 64KB that a .COM file required, & I never did anything interesting enough with it for time to

    • Yes, a nice reminder of string processing problems. The problem they worked on isn't exactly string matching. They are trying to find the minimum "edit distance" between 2 strings. There are a lot of very similar seeming problems in this area.

      If all one is trying to do is test whether 2 strings match exactly, first check whether the lengths are equal. If they are, then it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginnin

      • Yeah, I realize they're not the same problem... I was just reminiscing a bit about a slightly different problem that I spent some time thinking about as a student.

        Another string-related problem I had fun figuring out (essentially re-inventing a variation of the Levenshtein distance formula) was a closest-match algorithm, say, for common search strings that are slightly misspelled, or as a spell-checker finding a closest potential match. A decent solution to this problem isn't terribly hard, but does requir

      • "first check whether the lengths are equal": I suppose if the string length is marked at the beginning of the string, that makes sense. But if not (e.g. the end of the string is marked by a null byte), doesn't that just slow things down? Because you have to traverse the strings twice: once to measure their length, and once to check for identity. I suppose it depends on the constant factor.

        "it might be better to compare characters at the end first, under the idea that similar strings are more likely to ma

        • A terminating null byte is a horrible way to denote the end of a string, as C/C++ designers eventually realized. The cost and space to maintain a separate value for the length of a string is O(1) no matter what operation is done. Not only does it take O(n) to find the terminating null, there's the additional complication of how to embed a null in the middle of a string, without terminating it. Use an escape sequence? Just don't allow the terminating character in the string? No modern string manipulatio

  • Inaccurate Summary (Score:3, Interesting)

    by Anonymous Coward on Wednesday September 30, 2015 @05:04AM (#50625577)

    First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.

    Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.

  • The paper gives a "Proof By Contradiction" which would violate "The Strong Exponential time Hypothesis". However this Hypothesis itself is still unproven. Not that I hold any authority as to the validity of it all, though :)
  • The result is truly about limits of computations and not about practical limits. For example, I will happily accept an algorithm that runs in c*2^(a*N) time when a is 1e-100 and c is not big as for practical purposes it runs in constant time. It is fascinating topic on its own why we do not have such algorithms for practical tasks.
  • Perhaps the only way to avoid the mathematically imposed restrictions on conventional computing machinery would be to (literally) use a different mechanism. One that is inspired by the nature of the problem.

    For example, consider the problem of ordering a collection of random-length pieces of uncooked spaghetti. They can be ordered in O(1) time by merely grasping them in your hand and striking them against a table top. Ordering 500 strands takes no more time than 50 (ignoring preparation time). Scaling up is

    • The mechanism isn't different and your spaghetti based computer is still stuck with the same limitations as anything else.

      What you're describing is a special purpose instruction for ordering a bunch of random length bits of spaghetti in parallel, and then throwing more and more hardware at the problem. In fact, the amount of hardware scales as O(N) with the amount of spaghetti.

      This isn't actually a new mechanism at all: complexity classes also deal with parallelising decision problems. For example NC comple

      • by jdagius ( 589920 )

        > While those methods can no doubt solve certain problems much more quickly than
        > conventional serial or even conventional parallel computers, there are certain problems that
        > they simply can't help with.

        The price paid for surpassing "optimal" general solutions of a problem is that they won't work on all instances of the problem. Quite often that is an acceptable price. It's a trade-off.

      • "In fact, the amount of hardware scales as O(N) with the amount of spaghetti."

        No, because if you double each side in a cube, you get an eight-fold increase in volume. 2 x the material to extend the cube gives you 8 x the spaghetti you can fit in it.

        • No, because if you double each side in a cube, you get an eight-fold increase in volume.

          The area of the table you need is roughly equal to the area of the stacked bunch of spaghetti. Since each strand is an approximately fixed diameter, they're about the same area each.

          If you double the number of strands, then you need a flat surface of twice the area. So, the hardware requirements are O(N).

          If you want to include the *length* of spaghetti instead of just the number of strands then it gets more complex. Howe

          • Why wouldn't you need length? You're sorting on length. You need something to hold the spaghetti strands in an upright position, so you need volume. Volume increases faster than the material needed to bound it does.

            http://www.tiem.utk.edu/~gross... [utk.edu]

            Surface area = 6 * length^2

            Volume = length^3

            The more hardware you add to make the cube bigger, the much more spaghetti you can sort. The hardware increases less than O(n).

            Even if other hardware components increase at O(n) (not sure of that even), because your cont

            • In your O(n) which n are you talking about?

              The GP seemed to be referring to the number of strands. The minimum increase is O(n) with the number of strands.

              If you're talking about volume, it gets rather trickier. A 5 mile high stack of spaghetti won't stand up under it's own weight. The trouble is the mass increases with the cube too.

    • Your method doesn't "sort" the spaghetti. It aligns their ends against the same reference ( the table top ), which has nothing to do with sorting. They will still be, in your hand, in haphazard "order", i.e. there will be no order at all, as the shortest may very well end up immediately next to the longest.
      • by jdagius ( 589920 )

        > It aligns their ends against the same reference ( the table top ),
        > which has nothing to do with sorting.

        But in the process of alignment, a linear ordering is imposed in one dimension, along the axis of the spaghetti rods, such that if you traverse this axis towards the table top, you are guaranteed access to the members of an ordered collection, tallest member first. Quite often that is an acceptable solution, especially if you are only interested in the tallest strands. Perhaps not a general solut

        • I AM an engineer, thank you. A linear ordering along one dimension is called a "partial ordering", which is often quite useless, even for an engineer :-/
          • by jdagius ( 589920 )

            No. It is a _total_ ordering over the relation "=" in the sense that all pair of elements in the linear are comparable, i.e. elements a and b must satisfy "a=b" or "b=a", which is the requirement for "totality".

            In a _partial_ ordering (such as a collection of tree-structured elements) not all elements are necessarily comparable using "=", e.g. elements in different paths.

            In plainer English: the strands of a spaghetti-sorted collection are totally ordered by length when traversing along the strand axis towar

        • by tepples ( 727027 )

          Alignment is O(n) in the length of rods. Querying by eye is O(n^.5) in the number of rods, and querying by touch is O(n) in the length of rods.

      • To sort spaghetti by size, doing it by hand sounds very inefficient.

        A sorting method where it slides down a plane with slats that increase in size until the entire length can fall into a slot, much like a method for sorting coins, sounds a way to solve both the length and order problem. Of course that will never scale 'faster' than the number of sorting elements you use.

    • Wait: finding the single longest strand is O(1). Sorting them is O(n), because you have to keep picking out the longest strand.

  • by JoshuaZ ( 1134087 ) on Wednesday September 30, 2015 @07:12AM (#50626005) Homepage
    SETH is a very big assumption, much stronger than even assuming P != NP. Essentially, the exponential time hypothesis (ETH) says that any algorithm which solves instances of 3-SAT https://en.wikipedia.org/wiki/3-SAT [wikipedia.org] will have worst case running time that is exponential. However, it is conceivable that one could have a whole series of algorithms which each solves 3-SAT better and better. The idea is that there's an algorithm a_i which solves 3-SAT in time O(x_i^n) and as i goes to infinity, x_i goes to 1. SETH is the hypothesis that says essentially that that doesn't happen. Many people are not convinced that ETH is true although it certainly looks plausible. I think most people who have thought about this consider the possibility that ETH holds and SETH doesn't hold to be a deeply weird and generally unlikely option.
  • Strings T ( the "text" ) and P ( the "pattern" ) both use an alphabet A of N symbols.
    Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all" ;-)  ]
    String T has length M.
    Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]

    Do the same thing for P, yielding a number sigP.

    if sigT mod sigP yields zero, T contains P.

    Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
    so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225

    P = "BR". B->5, R->67
    so sigP = 5^1 * 67^2 = 22445

    644256903225 mod 22445 = 0, so T contains P.

    Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Sch&#246;nhage-Strassen with complexity O(log n log( log n ) ).

    Just slurps enormous amounts of memory :-)

    • Maybe I'm missing something.

      ABRE -> A^1 * B^2 * R^3 * E^4
      ABER -> A^1 * B^2 * E^3 * R^4 = A^1 * B * B * E^3 * R^2 * R^2 = A^1 * B * E^3 * R^2 * (B^1 * R^2)

      So your calculation says that ABER contains BR

      And assuming I've done the calculation correctly using your numbers for A B R E, I get ABER = 3320400962775, BR=22445.

      3320400962775/22445 = 147934995 = BRE*A

      • You are absolutely right. I just coded the algorithm in a programming language and noticed the same thing. It needs some extra encoding for symbol successor / predecessor relationships. I'll think about that while walking the dog post an update here ASAP.
  • If we define the genome being searched as the pool, and the string being searched for as the key; The paper could be completely correct when there is no reuse of the pool or keys. If you search the same pool for multiple different keys I can think of ways to pre-process the pool such that the first search + pre-process time still obeys the performance constraints their paper outlines, but subsequent searches of the same pool for a different key could happen in O log n time, or even constant time.
  • Fascinating article. This stuck out:

    If Williams or anyone else can prove the existence of an edit-distance algorithm that runs even moderately faster than normal, SETH is history.

    The edit distance between book and back is 2 ... In latin characters. It's 1 with kanji. The edit distance is probably several hundred if you're comparing the two on a per pixel basis instead of per letter. Edit distance varies by scope.

    Surely I'm misunderstanding something. It can't be that simple.

    • For a given algorithm, what is it's worst case running time across all possible inputs for that algorithm (of which there must be an infinite number)

      The answer the algorithm gives doesn't matter. It's how long you might have to wait for that answer that is important.

      • (of which there must be an infinite number)

        I didn't put this very well. What I mean is that there must be larger and larger cases.

        For the question "what is the edit distance between two n character strings taken from alphabet L" n can grow without bound. If there is a limit to n then, mathematically at least, the problem can be solved in O(1) via lookup.

        We often need the general case algorithm even when our inputs are finite. For example, there may be a fundamental limit to how many "characters" there are i

  • Often one finds there are domain-specific shortcuts. For example, if we know that "useful" DNA sequences tends to X letters long, then an algorithm that is optimized for matching sequences of X length perhaps can be devised rather than try to match everything to everything.

    Sometimes it's better for the org to search say 10,000 objects with an imperfect search tool rather than 500 with a perfect one. The most promising ones can then later be gone over with the fuller algorithm.

    I'm not a biology expert, but I

  • by CODiNE ( 27417 )

    To someone who hasn't taken Calculus yet it may seem like a brute force problem trying to find inflection points and the zero points of a curve. Even the area of an oddly shaped object would be an enormously repetitive process of slicing and adding.

    Yet once you know the tricks, it becomes trivial.

    It may be impossible but maybe not.

  • you can serialize a computer to hash (map-reduce) then the second computer (or thread) to do what you have to do. as long the alphabet is a closed set and smaller than current int(32/64), it's a good strategy for full match.
  • SETH is true. I know two of them.

Ocean: A body of water occupying about two-thirds of a world made for man -- who has no gills. -- Ambrose Bierce

Working...