Researchers Expanding Diff, Grep Unix Tools 276
itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."
Strange names (Score:4, Funny)
Re:Strange names (Score:4, Insightful)
I expect those are just the spoken names and that the commands will still be single words, similar to:
"GNU awk" -> gawk
"enhanced grep" -> egrep
Re: (Score:2)
"enhanced grep" -> egrep
Well, except that egrep is already taken :)
But yeah, your point is valid and probably correct.
Re: (Score:3)
and I really should spend a few more seconds thinking about what I'm responding to. Obviously gawk and egrep are existing tools, given as examples, not proposed names for these new tools.
Re:Strange names (Score:4, Informative)
and I really should spend a few more seconds thinking about what I'm responding to
That's not what Slashdot is about........
Re: (Score:3)
But of course, "eegrep" isn't :)
(enhanced enhaced grep)
Re:Strange names (Score:5, Funny)
Next thing you know we'll have CSIgrep. (enhance enhance enhance grep)
Re:Strange names (Score:5, Funny)
Re: (Score:2)
yes, but it is nice to know that all of your expectations for the first 26 minutes are incorrect.
Re: (Score:2)
Re: (Score:2)
No, those 30min is per bit.
But you'd be surprized by the amount of information you can gather from a single bit!
Re: (Score:2)
If you use a TV instead of a monitor, science and computer stuff runs really, really fast.
Re: (Score:2)
What if you pipe the results through /dev/tivo?
Re: (Score:2)
Just wait until Microsoft sees your post and we'll have eeegrep.
Re:Strange names (Score:4, Funny)
Just wait until Microsoft sees your post and we'll have eeegrep.
No, I expect they'd call it grep#. And when Apple forks their own version, it'll be objective grep.
They should call it... (Score:4, Insightful)
perl. Isn't this exactly why perl was invented?
Re: (Score:2)
Re: (Score:2)
Yes, also sed, and awk.
They are still ages behind prolog, that will parse context dependent texts....
the perl man page (Score:3)
From the header of 1988 perl man page:
Submitted-by: Larry Wall
Posting-number: Volume 13, Issue 1
Archive-name: perl/part01
[ Perl is kind of designed to make awk and sed semi-obsolete. This posting
will include the first 10 patches after the main source. The following
description is lifted from Larry's manpage. --r$ ]
Perl is a interpreted language optimized for scanning arbitrary text
files, extracting information from t
Subject line is not part of the comment (Score:2)
They should call it... perl. Isn't this exactly why perl was invented?
Perl could do this - with the right libraries. But that's the real value they're adding here. They created tools that operate on files with knowledge of the structure of those files. So for instance a "diff" between two XML files with identical contents but differences in formatting could report that the files are identical... Or if you had some file structure that defined a directed-graph structure, a format meant to be edited in-place (and which therefore might sometimes have holes in it where data wa
Re: (Score:3)
Yay, a tools thread!
I am liking meld (python-based visual diff)
But I suppose they have a different concept of hierarchical diff than diffing/merging two directory structures.
Re: (Score:2)
If the FS supports spaces in filenames, then you have broken code if you can't tolerate it. MS wisely put a space in the "Program Files" name when they added long filenames to Windows. That'll put any delusions about being able to ignore it to a direct immediate stop.
Re:Strange names (Score:4, Interesting)
But having to use quotes every time you call a command is a sure way to make sure your command is never used.
Would you rather type this:
./"Context-Free Grep" ...
./cfgrep ..
or this:
Re:Strange names (Score:5, Insightful)
If you don't like a tool's name, export an alias.
It's not about typing commands as much as it's about making these work:
Versus these:
A lot of scripts you run into are just broken because of braindead assumptions.
Re: (Score:2)
touché, I've gotten it right before but not this time!
Re: (Score:3)
in scripts, i pretty much quote everything. seems to be the way to avoid problems. of course, i'm not a sysadmin by trade, so maybe it's bad for some reason or something.
when at the prompt i hit tab.
We'd probably avoid a lot of problems, if people wouldn't be so lazy to not type a few extra characters.
Re: (Score:2)
./C makes that ok, but that's not the problem. The problem is that you loose on one level of quotation.
Re: (Score:2)
Just because there is a way to make it work doesn't means there isn't a problem with it. All unix shells can handle spaces in filenames, but the methods to do so are not always intuitive and it's easy to mess up things like shell scripts. Even the "proper" solutions have problems.
And I can't stand "Program Files", what a mess that has been.
Re: (Score:2)
Re: (Score:2)
Re:Strange names (Score:5, Informative)
Re:Strange names (Score:5, Insightful)
II mean, where would we end up if unix commands actually give a hint what they are doing
As a unix novice, if I wanted to search for something, my first choice of course would be grep
Also if I wanted help on something, the first word that jumps to my mind would be man
heh.
Comment removed (Score:5, Funny)
Re: (Score:2, Funny)
Bonus points if the command is an inscrutable acronym that refers to itself.
Re: (Score:3)
Re: (Score:2)
Also if I wanted help on something, the first word that jumps to my mind would be man
If you want help, perhaps you should read the MMMAAANNNual.
hint.
Re: (Score:2)
Definitely II mean, where would we end up if unix commands actually give a hint what they are doing ;-)
As a unix novice, if I wanted to search for something, my first choice of course would be grep
Also if I wanted help on something, the first word that jumps to my mind would be man
heh.
It's a reasonable assumption that unix was designed specifically to be counter intuitive.
Re:Strange names (Score:5, Funny)
You have to figure in two's complement notation. If it's sufficiently counter-intuitive, the sign bit flips over and it becomes totally intuitive.
Re:Strange names (Score:5, Funny)
Unix is user-friendly, it's just picky about who its friends are.
Re: (Score:3)
People were using terminals that were as slow as 110 baud. No one wanted to type extra characters.
Re:Strange names (Score:4, Interesting)
Alas, history and lots of shell scripts have probably made existing command names unchangeable. History in this case goes back to the time people got RSI from ASR-33 Teletypes and didn't want to have to type very much, and names that make sense only if you know other programs (in ed, "g//p" prints all lines containing the specified regular expression, hence the name "grep").
That said, we programmers are users of programming languages as much as Joe Sixpack is a user of the desktop, and surely we deserve good design as much as they do, so we can get things done rather than taking perverse pride in mastering needlessly ghastly syntax.
Re: (Score:2)
Shame on me for typing literal greater than and less than. That should have been "g/<regular expression>/p".
Re:Strange names (Score:5, Insightful)
As to yours though.. I wouldn't want spaces in my commands. How do you tell where the command ends and the arguments begin?
As for man... man is the MANual. That's not that bad is it? Ok, help might be a little better but it's not a big deal unless you are very closed minded. It's really a history thing. Man wasn't just somebody's idea of a help command. Unix (or Unics as it was called back then) originally actually had a manual. As in dead trees paper! It got big. Real big. One day Dennis Ritchie accidentally dropped a copy and killed his dog. Flattened the poor girl like a pancake. After that he decided it needed to be digital. Man is a digital copy of that original dog killing book plus decades of additions and updates. Thus it is man(ual).
Now should manual have been "manual" or maybe the real whole title "Unix Programmers Manual"? It might be easier to remember. 5 years after you learned that command and you are still typing it 5 times a day would you still appreciate the ease of using real whole English words? Are you that abc? (abreviationally challenged) Or do you just really love typing. Is your r/l name Mavis Beacon?
That's how a lot of Unix commands are, they make plenty of sense with history. I'm sure grep and the others all have their own stories. Well.. not all. How much of a story does it take to come to ls is a lazy way to type list? Oh, yah, you are AbC. Sorry about that.
Yes, the history of decades old programming decisions isn't really something you want to learn to use an OS (or any other software). But what's the alternative? Throw everything out x number of years and start over? It sounds great when you are a hopeless newbie but once you actually learn something do you want to do it all over again every 10 years just to make it easy for the next batch of basement kiddies? Your clock is ticking too you know! Now get off my lwn!!!! (lawn)
P.S. Ok, Ok, I made up the dog part of the story. But it COULD have happened! The rest was real. Actually, I don't KNOW that it didn't happen... hmm....
Re:Strange names (Score:5, Insightful)
Very true. Unix programmers seem to follow these rules:
So these tools will likely be run as "ctxtfrgrp" and "hierdiff" or just "cfgrep" and "hdiff"
Re: (Score:2)
Like 'cat' for concatenate, or vi for what exactly?
"vi" is short of "visual".
First there was "ed", the, you guessed it, "editor". But "ed" was a real pain to use, because you wouldn't see what you were actually editing (if you ever used ed, you'd know what I mean). So the "visual" editor "vi" was invented.
Re: (Score:2)
"Catenate" is actually a word and means the same thing as "concatenate". Unfortunately, 1 - epsilon of people associate "cat" with F. domesticus, so "cat" was a really lousy choice.
Re: (Score:3)
The distinguished artist sees "cat" as an excellent choice—a palette for the creative file-namer, a mad-lib left incomplete!
At least, that's how I justify log files named dog and crap.
Re: (Score:3)
Make it a shell built-in and chide the user if only a single input was used [partmaps.org] (e.g. cat file | grep blah).
How's it compare to Meld? (Score:2)
A nice GUI diff for Linux. (Has 3-way).
Click here to install [deb]
Re: (Score:3, Insightful)
It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page [sourceforge.net] would have been a bit more useful.
Re: (Score:3)
Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").
Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).
Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.
Re: (Score:2)
Or ASCII GUI: vimdiff ... works fine, also with 3 files ...
Re: (Score:2)
awk? (Score:2)
Done! It's called "awk". Just set the RS and FS fields as appropriate. :P
Perl (Score:3)
Follow the money...? (Score:2, Interesting)
I wonder what's the interest of these two in this.
-dZ.
RTFA? (Score:5, Informative)
FTFA:
Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
Re: (Score:2)
I wonder why people feel the need to "sign" their posts, when their username is quite clearly visible at the top.
-nk
Re:Follow the money...? (Score:5, Insightful)
Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).
If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.
So, my question is, why aren't more IT-heavy corporations/government departments not involved?
Re: (Score:2)
Why does that necessitate screwing around with grep? I can sort of see modifying diff, but with grep if you need that data you'd write a new program to parse it and pipe it.
Re: (Score:3)
So weird. I spent the last 6 months writing some Java libraries that do exactly this. There were some similar things out there, but they weren't licensed appropriately for my uses, or were WAY too expensive. Writing a hierarchical diff engine is the most complex thing I've ever done, hell writing an efficient pure diff engine is insane itself. You have to identify blocks/structure. then you have to diff the structures, then you have to diff the content in the structures. Once all of that is said and done th
Re: (Score:3)
LOL and that my friend is the hard part. It cost me $4000 in legal fees to make sure they are not owned by the company I work for, and 6 weeks of work. I'm leaning towards an AGPL/open core model. I just see so many people NOT happy with open core stuff. Also, I didn't get a grant from Google or the D.O.E. And these are just small, yet integral, parts of a larger system. That I don't really want to give away yet. Hell, deciding on licensing is harder than coding sometimes. Gotta feed the family you know,
Re:Follow the money...? (Score:4, Interesting)
I wouldn't call it a cancer. But it's definitely useful if you don't ever want commercial companies to use your code in public. It matches up well with the open core model. Commercial people will only use it if you can give them a differently licensed copy of the code. Apache, MIT, and BSD are great if you truly want to give your code away and don't care what people do with it behind closed doors. AGPL is nice to make sure people always give back. LGPL and GPL nice if you only want them to give back if they change it. Should people pay and how much is an age old question. I have to balance the cost of support and development vs. the cost of the product. The more I lean on the community the less I can charge and the more exposure I get. While in the other direction I get more money, but have to spend more of it. And there is no one size fits all solution to any of this.
Re: (Score:2)
A new years resolution it is then...
Re: (Score:3)
Vast amounts of OS SW has been funded by the government. BSD was developed by UC Berkeley, which is largely funded by Pentagon contracts.
And the Internet.
Meanwhile, the vast majority of open source projects never get past the opening statement.
You clearly don't know what it takes to accomplish a project like this one. What have you ever done, that gives you some standing to announce that this Usenix project is a load of crap?
Re: (Score:2)
I'm no doctor but I can tell that wound is infected.
Interesting... (Score:3, Interesting)
No download link? (Score:2)
Link to one of their papers on these tools (Score:5, Informative)
Almost vaporware (Score:2)
sgrep (Score:2)
Object grep (Score:2)
I'd like a grep tool that could scan XML data for instances of objects (according to some XSD or DTD), and take object state values as arguments to search objects for.
If it could scan objects in memory I'd love that better, but XML seems the only likely candidate for a format that a universal tool would parse.
Re: (Score:3)
XML is ok, but there are many data formats that could really use a diff/grep utility that could make sense of them. HDF5 and NetCDF are nice in the scientific community, for example. Computer graphics geeks might find intelligent diff/grep tools for the Renderman format to be useful. Office users might want to know if two documents are genuinely different or were compressed differently. Hell, it would be incredibly useful if they could diff a MS Office file and LibreOffice file in their native formats to se
Ooooh! (Score:4, Interesting)
As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar [wikipedia.org].
That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.
Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.
Sounds really cool.
Re: (Score:2)
It will be interesting to see what they come up with. From the paper posted above it looks like it will definitely be taking "wisdom" about certain file types, but I hope they also work on some fuzzy guessing modes as well that do not require prior knowledge of the language being parsed.
The main potential for ick factor is whether they can manage to get a set of commandline flags that can be used/learned incrementally so you don't have to memorize a ream of flags just to get something useful done, and can
Re: (Score:2)
One of the reasons context-free searches isn't more prevalent in everyday computing is the increase in complexity and thus computation resources needed to process it. If regular grammar was linear, then context-free is closer to linearithmic (n*log(n)).
Regular expressions can handle multi-line searches as easily as it does single-line searches (some people above were saying how the multi-line aspect of this would be real useful). The line delimiter is merely a convenience.
What it can't handle are nested sea
Microsoft Ad (Score:4, Interesting)
I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.
Re: (Score:2, Funny)
That's great for all fifty people who use Powershell.
Re: (Score:2)
That's great for all fifty people who use Powershell.
You might need to get a better "grep" on reality, it's not 2004 anymore.
Oh, to suffer the slings and arrows... (Score:2)
I know I'll be modded down
Dude, the only part of your post that I find objectionable is this assumption that you're going to be crucified for posting your thoughts. I know that there are some people on Slashdot who are pretty predictably triggered to shout down certain opinions - just don't assume that everyone here is like that, OK?
I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what the
Re: (Score:2)
I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what they describe bgrep and bdiff doing could be accomplished in Powershell. I've never been too clear on some of the particulars of how that would be done, though. As I understand it, you can search/filter either XML data streams, or a sequence of .NET objects. Would the way to accomplish this in .NET, then, be to have a commandlet that opens the source file and passes them through as .NET objects? It would be a bit less compact than having the special type handling right in the "find" or "filter" command but it does lend a certain clarity to things, too...
Powershell and .Net can collaborate both ways. As an example, many recent Microsoft products are using .Net for the GUI but in the backend all the actual work is done in Powershell. The opposite is true - one can plug a .Net component in a Powershell script, as an example to do a custom filter.
The best example for the PS pipe model is with the VMWare extensions (PowerCLI). You can get a full inventory by writing a script like this:
Get-Host | Get-Vm | Export-Csv c:\myinventory.txt
In the CSV you get all the p
Re: (Score:2)
I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.
Sure, and it's been possible in Perl (for example) forever:
use File::Slurp;
; /$multi_line_pattern/){
my $multi_line_pattern = join("", @ARGV);
my $text = read_file( 'filename' )
if ($text =~
# do something useful.
}
The only problem with the above is that it fails in anything other than trivial situations.
The issue isn't passing things through filters, it's doing so in a way that you don't have to write insanely devious and complex filters. This grep tool is still only at the design stage, so I'll
Powershell envy (Score:2)
So what? Maybe people want a non-proprietary solution that works on more than one OS.
If there are such people, and it's not just me, I'd love to oblige them. :) I really need to get crackin'...
darcs (Score:2)
There's a version-control system called darcs (written by the son of a colleague of mine) that incorporates some interesting ideas along these lines. For example, say you have a program with 100,000 lines of code, and there's a function in it called Foo, which is called thousands and thousands of times. You want to change the name of the function to Bar. In a traditional diff-based system, this results in thousands of differences. Darcs is supposed to be able to handle changes like this and recognize that i
existing tools and suggestion (Score:2)
PCRE has recursive patterns (available as pcregrep) and .NET has balancing groups, also allowing grep-like operations involving context-free grammars. For XML data, there are various XML query languages that allow wonderfully complex queries over XML structures. There are also refactoring tools that allow syntax-aware searches across source files.
For diff, the situation is a bit more complicated. There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based
Re: (Score:2)
There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based diff tools already. It seems difficult to come up with something more generic. Let's say you want to diff programming language source files in languages for which there is no diff tools. What good is a context free diff tool going to be? You'd need to specify the entire grammar for the language.
I don't know if you didn't read this bit, or if I've misunderstood your post... But the basic (proposed?) approach of bgrep and bdiff is to provide a plug-in mechanism that would be used to extend the tools to new languages and data types. So, yes, you would have to specify the entire grammar for the language, but you'd only do it once... Or preferably, someone would have already done it for you. :)
Re: (Score:2)
The use cases, options, and interfaces are different for searching programming language source files, XML files, and other text. So, you really need at least three tools: bgrep-lang, bgrep-xml, and bgrep text. Each of those might then have a plugin mechanism. But these three classes of tools already exist. Trying to force them into a single command line tool makes little sense to me.
"bgrep-text" is just pcregrep.
"bgrep-xml" is any one of a number of XML query and search tools, using XQuery or similar la
Terrible idea (Score:5, Insightful)
This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...
FTFA:
Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.
If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT [wikipedia.org] off the shelf. It's already customizable to whatever data format you're working with.
FTFA:
With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.
Re: (Score:2)
no big deal; these features can become a FLAG option on grep and diff and I wouldn't care-- it could be useful; I just use perl for most stuff and grep for only really simple stuff. Ok, I'd probably still use perl... its my swiss army pocket knife I use to hammer everything ;-)
10 changes have been made - Disagree (Score:2)
Suppose you signal the nesting level by indentation, as most programmers today do.
If you add a condition around some code, then for example 3 lines might indented, resulting in 5 lines being altered instead of the 2 which actually have changed.
For this, the proposed improved grep and diff might be good, at least better than the current state of diff. Okay, maybe I'm not telling about the -b flag, but the -b flag might be a problem if you code in whitespace or so ;-)
http://en.wikipedia.org/wiki/Whitespace_(p [wikipedia.org]
Re: (Score:3)
This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...
I'll take this on. It's a subject that is of particular interest to me.
First of all, you have to consider whether it even matters that a tool violates "rules" of the "Unix philosophy". I mean, seriously, why assume that some system design ideas cooked up 30-40 years ago are necessarily the One True Path? Because "those who do not understand Unix are doomed to reinvent it poorly"? What if the designers in question do understand Unix? Or what if <gasp> they might actually have some ideas that surpa
Structural Regular Expressions (Score:2)
TXR! (Score:2)
I'm also working on a text processing tool that deals with blocks of data is already here.
http://www.nongnu.org/txr [nongnu.org]
Re: (Score:2)
Well if we can use our computers more efficiently then we'll save energy. On the other hand I can't imagine what use the DOD would have for this, especially since they seem to run Windows at every opportunity...
Re: (Score:2)
I think maybe some of the scientist types at the DOE were behind the funding.
Re: (Score:3, Interesting)
Check out the wikipedia article for supercomputers [wikipedia.org], and you'll see DOE mentioned.
Tools like this could help with analysis and finding certain data sets. IIRC, regex are already used in DNA sequencing. There is probably a similar application and use for tools like this with their data.
Re: (Score:3)
Re: (Score:3)
This would work, but better. No, I'm not being flippant.
If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.
I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match
Re: (Score:2)
Are you sure?
They can probably do it remotely on must OS's anyway. Quick - make friends with Theo.
Perl to wa chigau no da yo! Perl to wa! (Score:2)
Why do we need to write another perl?
Is it really "writing another perl"? The meat of these tools (which, I think, aren't yet implemented?) is that they filter and compare parsed data structures - and provide plug-in hooks so people can insert parsers for additional data types. Certainly this could be done as a Perl library - and doing so might have some advantages over creating new tools with their own plug-in mechanism. But implementing bgrep and bdiff is nowhere close to "writing another perl".