Tracing the Limits of Computation 82
An anonymous reader writes: For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this "edit distance" could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was "optimal" — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. But researchers aren't quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true.
The algorithm isn't clever, but scales well. (Score:4, Insightful)
Comparing strings is a case for specialized hardware, not for more sophisticated algorithms.
Re:The algorithm isn't clever, but scales well. (Score:4, Informative)
It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.
It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.
Re: (Score:1)
Definition from TFA: For any two sequences x, y, EDIT(x, y) is equal to the minimum, over all sequences z, of the number of deletions and substitutions needed to transform x into z and y into z. This is equivalent to the one you cited: insertion into x is the same thing as deletion from y.
Abstract it away with change tracking (Score:2)
The formal description of the edit distance [wikipedia.org] is different from that described in the article. This is the one used in some cases of DNA or text comparison.
It's not just about changing a symbol, it also includes adding and deleting symbols. That makes simple parallelization impossible.
It's not clear if the mathematical proof cited in the article is the simple one of only changing symbols, or the more complex case of symbol addition and deletion. From the article, it appears that they are talking about the simple case, which makes the more complex algorithm even more intractable.
That does make the simple hardware a little trickier, although for specific purposes there are some tricks you may be able to design for practically that become more efficient as the size of the insert grows. (An entire gene, for example, would be easier per gene to change than a single base pair).
Basically you can make a list of the changes. If you know you inserted a big object at place X, you can dynamically generate any requested index without having to rewrite the whole really long string, for exampl
Re: (Score:1)
No not really. It is easily parallelized yes. But you really don't improve big O speed like you do with sort algorithms.
String compare will always be O(n)
Now based of of the data you may be able to speed it up say your string of data contains particular patterns where you can group as a one letter such as compressing the string and compare the two compressed data, or say with DNA you just compress the data to 4 bit for value so you are searching more optimally. However you are still going in linear speed.
Th
Re: (Score:1)
The human brain is known to take shortcuts to do string comparisons so why can't a computer to a certain degree? If memory serves, and I apologize if I'm off it's the end of a long day, when reading English at least the brain will read the first letter and the last letter and approximate the length of word to make the connection to the word stored in the brain. This apparently works up to a character limit then the brain chunks the words so compliment and complement might get confused as cplt+length for b
Re: (Score:1)
Read and understand the problem and your answer will become trivial. The problem you present in your post has nothing to do with the string comparison algorithm in general (or the sequence alignment problem in particular).
Re: (Score:3)
Re: (Score:2)
DNA mutations are more common in certain places(don't ask why, probably related to the geometry of packaged DNA and some other factors). An algo could spot check these places first to give preliminary answers but it would still take until it's finished to get a true answer.
Re: (Score:1)
This is one of many diffs (ahem) between DNA sequencing and natural language processing. Another is the alphabet size: DNA has an alphabet size of 4, while the alphabets (number of phonemes or graphemes) of natural languages range from a low of 11 (Rotokas and Mura) to a high of perhaps 140-something (!Xu, although the exact number depends on how you count things). Of course written Chinese and Japanese have much higher numbers of graphemes.
It's also the case that some writing systems don't mark word bounda
Re: (Score:2)
When I was thinking of how it could apply the spacing is the "on/off switch" between genes, the individual genes are the "words". It wasn't overly well thought out, just thinking of how one could use known patterns to shorten processing time.
Re: The algorithm isn't clever, but scales well. (Score:1)
Thanks for those hot assurances without having actually run the experiments yourself.
Re: (Score:2)
So don't compare only 2 DNA strings at once, or you can compromise and get good imperfect results. Every time you discover a matching sequence, you can leverage the information you have gained from earlier comparisons. DASH: Searching Compressed DNA [amazon.com] (disclosure; A friend's PHD thesis).
This reminds me... (Score:5, Interesting)
This article reminded me of learning about string-matching algorithms as a student. When you first naively implement a string-matching algorithm as a student yourself (sans prior research), you'd generally start with what I'd call the string-matching equivalent of a bubble sort - it works, but it's hopelessly inefficient, as you begin a simple search with each new character in the source text, backtracking if the match doesn't occur, and checking the next potential match.
When you try to get a bit more clever, you realize there's probably a way to ensure that you can get examine each character in the searched text once and only once, never having to backtrack at all, and you get something like the Knuth–Morris–Pratt algorithm.
However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm, which actually allows you to skip over checking some characters entirely, which I would have intuitively thought impossible without making that mental leap. Learning about these impressive algorithms was one of my favorite part of the Computer Science curriculum when I was in school.
It will be interesting to see if someone eventually breaks through the current state-of-the-art limits of string comparison in our lifetime. It would be a bit sad if the hypothetical maximum efficiency were already reached, as was predicted by these mathematical models, because the current best algorithms still require a lot of computation time. I've long since devoted myself to more practical topics as a working programmer, but it's fun to peek into what's happening at the bleeding edge of pure computer science research.
Re: (Score:1)
However, it takes a fairly impressive leap of intuition / invention to get to the Boyer–Moore string search algorithm
Then you would think I'm a fucking "genius". When I was 10 years old I implemented Jump Lists for my string search algorithm. Not because my 33Mhz system was too slow for the naive implementation, but because I was tired of waiting for QBASIC to stop thrashing memory. I was implementing my own native x86 compiler using the only interface I had available (QBASIC). A C compiler cost 1/3rd of what a PC did back then. So, without prior training in CS I imagined how one might transform a text program into m
Re: (Score:1)
So I was not the only one to do that repeated-string-mapping method of compilation (although I did mine in high school (using C for bootstrapping, rather than QBASIC, IIRC)...earlier I was too busy playing with Visual Basic, which had its own compiler)...interesting.
I never tried to make my compiler optimizing, despite having a plan in my head to do so, because the binary for the compiler itself already fit in 64KB that a .COM file required, & I never did anything interesting enough with it for time to
edit distance, not just matching (Score:2)
Yes, a nice reminder of string processing problems. The problem they worked on isn't exactly string matching. They are trying to find the minimum "edit distance" between 2 strings. There are a lot of very similar seeming problems in this area.
If all one is trying to do is test whether 2 strings match exactly, first check whether the lengths are equal. If they are, then it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginnin
Re: (Score:2)
Yeah, I realize they're not the same problem... I was just reminiscing a bit about a slightly different problem that I spent some time thinking about as a student.
Another string-related problem I had fun figuring out (essentially re-inventing a variation of the Levenshtein distance formula) was a closest-match algorithm, say, for common search strings that are slightly misspelled, or as a spell-checker finding a closest potential match. A decent solution to this problem isn't terribly hard, but does requir
Re: (Score:1)
"first check whether the lengths are equal": I suppose if the string length is marked at the beginning of the string, that makes sense. But if not (e.g. the end of the string is marked by a null byte), doesn't that just slow things down? Because you have to traverse the strings twice: once to measure their length, and once to check for identity. I suppose it depends on the constant factor.
"it might be better to compare characters at the end first, under the idea that similar strings are more likely to ma
Re: (Score:2)
A terminating null byte is a horrible way to denote the end of a string, as C/C++ designers eventually realized. The cost and space to maintain a separate value for the length of a string is O(1) no matter what operation is done. Not only does it take O(n) to find the terminating null, there's the additional complication of how to embed a null in the middle of a string, without terminating it. Use an escape sequence? Just don't allow the terminating character in the string? No modern string manipulatio
Inaccurate Summary (Score:3, Interesting)
First, having worked on software for DNA sequence alignment years ago, I don't think anyone thought they could do better than Smith-Waterman (a minor DNA-centric extension of a DP algorithm to find Levenshtein distance). Rather, everyone was using clever implementation techniques (SIMD instructions, GPUs, FPGAs) and heuristics to limit the amount of brute-force aligning. Oh, and lots and lots of hardware.
Second, the summary author thinks that a dynamic programming algorithm isn't clever? They're either quite arrogant or don't know what they're talking about. Quicksort must be downright stupid, eh? Something for the short bus kids to play with.
Strong Exponential Time Hypothesis is unproven (Score:1)
not a practical limit (Score:1)
Spaghetti sort (Score:1)
Perhaps the only way to avoid the mathematically imposed restrictions on conventional computing machinery would be to (literally) use a different mechanism. One that is inspired by the nature of the problem.
For example, consider the problem of ordering a collection of random-length pieces of uncooked spaghetti. They can be ordered in O(1) time by merely grasping them in your hand and striking them against a table top. Ordering 500 strands takes no more time than 50 (ignoring preparation time). Scaling up is
Re: (Score:1)
I'm also pretty sure that designing and building any kind of computer can't be done in O(1) time. I'm calling that "preparation time" (which you were instructed to ignore).
Re: (Score:2)
The mechanism isn't different and your spaghetti based computer is still stuck with the same limitations as anything else.
What you're describing is a special purpose instruction for ordering a bunch of random length bits of spaghetti in parallel, and then throwing more and more hardware at the problem. In fact, the amount of hardware scales as O(N) with the amount of spaghetti.
This isn't actually a new mechanism at all: complexity classes also deal with parallelising decision problems. For example NC comple
Re: (Score:1)
> While those methods can no doubt solve certain problems much more quickly than
> conventional serial or even conventional parallel computers, there are certain problems that
> they simply can't help with.
The price paid for surpassing "optimal" general solutions of a problem is that they won't work on all instances of the problem. Quite often that is an acceptable price. It's a trade-off.
Re: (Score:1)
"In fact, the amount of hardware scales as O(N) with the amount of spaghetti."
No, because if you double each side in a cube, you get an eight-fold increase in volume. 2 x the material to extend the cube gives you 8 x the spaghetti you can fit in it.
Re: (Score:2)
No, because if you double each side in a cube, you get an eight-fold increase in volume.
The area of the table you need is roughly equal to the area of the stacked bunch of spaghetti. Since each strand is an approximately fixed diameter, they're about the same area each.
If you double the number of strands, then you need a flat surface of twice the area. So, the hardware requirements are O(N).
If you want to include the *length* of spaghetti instead of just the number of strands then it gets more complex. Howe
Re: (Score:1)
Why wouldn't you need length? You're sorting on length. You need something to hold the spaghetti strands in an upright position, so you need volume. Volume increases faster than the material needed to bound it does.
http://www.tiem.utk.edu/~gross... [utk.edu]
Surface area = 6 * length^2
Volume = length^3
The more hardware you add to make the cube bigger, the much more spaghetti you can sort. The hardware increases less than O(n).
Even if other hardware components increase at O(n) (not sure of that even), because your cont
Re: (Score:2)
In your O(n) which n are you talking about?
The GP seemed to be referring to the number of strands. The minimum increase is O(n) with the number of strands.
If you're talking about volume, it gets rather trickier. A 5 mile high stack of spaghetti won't stand up under it's own weight. The trouble is the mass increases with the cube too.
Re: (Score:2)
Re: (Score:1)
> It aligns their ends against the same reference ( the table top ),
> which has nothing to do with sorting.
But in the process of alignment, a linear ordering is imposed in one dimension, along the axis of the spaghetti rods, such that if you traverse this axis towards the table top, you are guaranteed access to the members of an ordered collection, tallest member first. Quite often that is an acceptable solution, especially if you are only interested in the tallest strands. Perhaps not a general solut
Re: (Score:2)
Re: (Score:1)
No. It is a _total_ ordering over the relation "=" in the sense that all pair of elements in the linear are comparable, i.e. elements a and b must satisfy "a=b" or "b=a", which is the requirement for "totality".
In a _partial_ ordering (such as a collection of tree-structured elements) not all elements are necessarily comparable using "=", e.g. elements in different paths.
In plainer English: the strands of a spaghetti-sorted collection are totally ordered by length when traversing along the strand axis towar
Re: (Score:1)
...oops forgot to escape my LT brackets. Substitute "<=" wherever you see "="
Re: (Score:2)
Alignment is O(n) in the length of rods. Querying by eye is O(n^.5) in the number of rods, and querying by touch is O(n) in the length of rods.
Re: (Score:2)
To sort spaghetti by size, doing it by hand sounds very inefficient.
A sorting method where it slides down a plane with slats that increase in size until the entire length can fall into a slot, much like a method for sorting coins, sounds a way to solve both the length and order problem. Of course that will never scale 'faster' than the number of sorting elements you use.
Re: (Score:1)
Wait: finding the single longest strand is O(1). Sorting them is O(n), because you have to keep picking out the longest strand.
SETH is a pretty big assumption (Score:3)
Re: (Score:1)
Prove it and you'll be famous.
(Or you can write a note in the margin of a book on algorithmic complexity saying that the margin is too small to contain your proof.)
Here is an idea to play with: Gödel notation (Score:3)
Index each of A's symbols with a distinct odd prime number p1...pN. [ only odd primes as I personally don't like 2, the "oddest prime of them all"
String T has length M.
Raise each prime index in T to the power of its location <= M in T, multiply them all, yielding a number sigT ( for signature of T ). [ begin counting locations with 1, not with 0, in which case you'll lose the information about T's first symbol... ]
Do the same thing for P, yielding a number sigP.
if sigT mod sigP yields zero, T contains P.
Example: T = "ABRE". A -> 3, B-> 5, R->67 (the 18th odd prime), E->13
so sigT = 3^1 * 5^2 * 67^3 * 13^4 = 644256903225
P = "BR". B->5, R->67
so sigP = 5^1 * 67^2 = 22445
644256903225 mod 22445 = 0, so T contains P.
Computational complexity: depends on complexity of the 2 multiplication steps. sigT will by definition be the largest number ( humongous, actually, for genome sequences ). So you'll need a very efficient multiplication algorithm; the best one I know is Schönhage-Strassen with complexity O(log n log( log n ) ).
Just slurps enormous amounts of memory
Re: (Score:2)
Maybe I'm missing something.
ABRE -> A^1 * B^2 * R^3 * E^4
ABER -> A^1 * B^2 * E^3 * R^4 = A^1 * B * B * E^3 * R^2 * R^2 = A^1 * B * E^3 * R^2 * (B^1 * R^2)
So your calculation says that ABER contains BR
And assuming I've done the calculation correctly using your numbers for A B R E, I get ABER = 3320400962775, BR=22445.
3320400962775/22445 = 147934995 = BRE*A
Re: (Score:2)
Re: (Score:2)
How many times are you searching? (Score:2)
I wonder... (Score:2)
The edit distance between book and back is 2 ... In latin characters. It's 1 with kanji. The edit distance is probably several hundred if you're comparing the two on a per pixel basis instead of per letter. Edit distance varies by scope.
Surely I'm misunderstanding something. It can't be that simple.
Re: (Score:2)
For a given algorithm, what is it's worst case running time across all possible inputs for that algorithm (of which there must be an infinite number)
The answer the algorithm gives doesn't matter. It's how long you might have to wait for that answer that is important.
Re: (Score:2)
(of which there must be an infinite number)
I didn't put this very well. What I mean is that there must be larger and larger cases.
For the question "what is the edit distance between two n character strings taken from alphabet L" n can grow without bound. If there is a limit to n then, mathematically at least, the problem can be solved in O(1) via lookup.
We often need the general case algorithm even when our inputs are finite. For example, there may be a fundamental limit to how many "characters" there are i
Domain-specific shortcuts (Score:1)
Often one finds there are domain-specific shortcuts. For example, if we know that "useful" DNA sequences tends to X letters long, then an algorithm that is optimized for matching sequences of X length perhaps can be devised rather than try to match everything to everything.
Sometimes it's better for the org to search say 10,000 objects with an imperfect search tool rather than 500 with a perfect one. The most promising ones can then later be gone over with the fuller algorithm.
I'm not a biology expert, but I
[Correction] Re:Domain-specific shortcuts (Score:1)
should be "...tend to be X letters long".
Maybe (Score:2)
To someone who hasn't taken Calculus yet it may seem like a brute force problem trying to find inflection points and the zero points of a curve. Even the area of an oddly shaped object would be an enormously repetitive process of slicing and adding.
Yet once you know the tricks, it becomes trivial.
It may be impossible but maybe not.
anything complex than O(n) (Score:1)
SETH exists (Score:2)
SETH is true. I know two of them.