Catch up on stories from the past week (and beyond) at the Slashdot story archive

Researchers Expanding Diff, Grep Unix Tools 276

Posted by timothy on Thursday December 08, 2011 @02:44PM from the now-with-raisins dept.

itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."

This discussion has been archived. No new comments can be posted.

Researchers Expanding Diff, Grep Unix Tools

Load All Comments

Search 276 Comments Log In/Create an Account

Comments Filter:

Strange names (Score:4, Funny)

by gnasher719 ( 869701 ) writes: on Thursday December 08, 2011 @02:46PM (#38305988)

Space characters in the name of a Unix command line tool is asking for trouble.

Share
twitter facebook
- Re:Strange names (Score:4, Insightful)
  
  by realyendor ( 32515 ) writes: on Thursday December 08, 2011 @02:51PM (#38306056)
  
  I expect those are just the spoken names and that the commands will still be single words, similar to:
  "GNU awk" -> gawk
  "enhanced grep" -> egrep
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by dougmc ( 70836 ) writes:
    
    "enhanced grep" -> egrep
    Well, except that egrep is already taken :)
    But yeah, your point is valid and probably correct.
    - Re: (Score:3)
      
      by dougmc ( 70836 ) writes:
      
      and I really should spend a few more seconds thinking about what I'm responding to. Obviously gawk and egrep are existing tools, given as examples, not proposed names for these new tools.
      - Re:Strange names (Score:4, Informative)
        
        by EdIII ( 1114411 ) writes: on Thursday December 08, 2011 @03:36PM (#38306596)
        
        and I really should spend a few more seconds thinking about what I'm responding to
        That's not what Slashdot is about........
        
        Parent Share
        twitter facebook
    - Re: (Score:3)
      
      by ivoras ( 455934 ) writes:
      
      But of course, "eegrep" isn't :)
      (enhanced enhaced grep)
      - Re:Strange names (Score:5, Funny)
        
        by ripler ( 19188 ) writes: on Thursday December 08, 2011 @03:24PM (#38306454)
        
        Next thing you know we'll have CSIgrep. (enhance enhance enhance grep)
        
        Parent Share
        twitter facebook
        
        Re:Strange names (Score:5, Funny)
        
        by Anne Thwacks ( 531696 ) writes: on Thursday December 08, 2011 @03:42PM (#38306676)
        
        CSIgrep would take 30 mins to get the result! (With ad breaks)
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by berashith ( 222128 ) writes:
        
        yes, but it is nice to know that all of your expectations for the first 26 minutes are incorrect.
        
        Re: (Score:2)
        
        by Tomato42 ( 2416694 ) writes:
        
        sign me in if it will search a 1TB data set in those 30min!
        
        Re: (Score:2)
        
        by marcosdumay ( 620877 ) writes:
        
        No, those 30min is per bit.
        But you'd be surprized by the amount of information you can gather from a single bit!
        
        Re: (Score:2)
        
        by bryan1945 ( 301828 ) writes:
        
        If you use a TV instead of a monitor, science and computer stuff runs really, really fast.
        
        Re: (Score:2)
        
        by jd ( 1658 ) writes:
        
        What if you pipe the results through /dev/tivo?
      - Re: (Score:2)
        
        by Noughmad ( 1044096 ) writes:
        
        Just wait until Microsoft sees your post and we'll have eeegrep.
        
        Re:Strange names (Score:4, Funny)
        
        by 93 Escort Wagon ( 326346 ) writes: on Thursday December 08, 2011 @08:31PM (#38310252)
        
        Just wait until Microsoft sees your post and we'll have eeegrep.
        No, I expect they'd call it grep#. And when Apple forks their own version, it'll be objective grep.
        
        Parent Share
        twitter facebook
    - They should call it... (Score:4, Insightful)
      
      by goombah99 ( 560566 ) writes: on Thursday December 08, 2011 @03:50PM (#38306792)
      
      perl. Isn't this exactly why perl was invented?
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by The Askylist ( 2488908 ) writes:
        
        I've always thought perl should be renamed SOS - Self-Obfuscating Scripting. But then again I prefer languages to be human-readable.
      - Re: (Score:2)
        
        by marcosdumay ( 620877 ) writes:
        
        Yes, also sed, and awk.
        They are still ages behind prolog, that will parse context dependent texts....
      - the perl man page (Score:3)
        
        by goombah99 ( 560566 ) writes:
        
        From the header of 1988 perl man page:
        Submitted-by: Larry Wall
        Posting-number: Volume 13, Issue 1
        Archive-name: perl/part01
        [ Perl is kind of designed to make awk and sed semi-obsolete. This posting
        will include the first 10 patches after the main source. The following
        description is lifted from Larry's manpage. --r$ ]
        Perl is a interpreted language optimized for scanning arbitrary text
        files, extracting information from t
      - Subject line is not part of the comment (Score:2)
        
        by Tetsujin ( 103070 ) writes:
        
        They should call it... perl. Isn't this exactly why perl was invented?
        Perl could do this - with the right libraries. But that's the real value they're adding here. They created tools that operate on files with knowledge of the structure of those files. So for instance a "diff" between two XML files with identical contents but differences in formatting could report that the files are identical... Or if you had some file structure that defined a directed-graph structure, a format meant to be edited in-place (and which therefore might sometimes have holes in it where data wa
  - Re: (Score:3)
    
    by rwa2 ( 4391 ) * writes:
    
    Yay, a tools thread!
    I am liking meld (python-based visual diff)
    But I suppose they have a different concept of hierarchical diff than diffing/merging two directory structures.
- Re: (Score:2)
  
  by pclminion ( 145572 ) writes:
  
  If the FS supports spaces in filenames, then you have broken code if you can't tolerate it. MS wisely put a space in the "Program Files" name when they added long filenames to Windows. That'll put any delusions about being able to ignore it to a direct immediate stop.
  - Re:Strange names (Score:4, Interesting)
    
    by adonoman ( 624929 ) writes: on Thursday December 08, 2011 @03:12PM (#38306322)
    
    But having to use quotes every time you call a command is a sure way to make sure your command is never used.
    Would you rather type this:
    ./"Context-Free Grep" ...
    or this:
    ./cfgrep ..
    
    Parent Share
    twitter facebook
    - Re:Strange names (Score:5, Insightful)
      
      by iluvcapra ( 782887 ) writes: on Thursday December 08, 2011 @03:48PM (#38306752)
      
      If you don't like a tool's name, export an alias.
      It's not about typing commands as much as it's about making these work:
      $ find . -name ".txt" | xargs wc $ for file in $*; do mv $file old/$file done
      
      Versus these:
      $ find . -name ".txt" -print0 | xargs -0 wc $ for file in $*; do mv "$file" "old/$file" done
      
      A lot of scripts you run into are just broken because of braindead assumptions.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by iluvcapra ( 782887 ) writes:
        
        touché, I've gotten it right before but not this time!
    - Re: (Score:3)
      
      by gangien ( 151940 ) writes:
      
      in scripts, i pretty much quote everything. seems to be the way to avoid problems. of course, i'm not a sysadmin by trade, so maybe it's bad for some reason or something.
      when at the prompt i hit tab.
      We'd probably avoid a lot of problems, if people wouldn't be so lazy to not type a few extra characters.
    - Re: (Score:2)
      
      by emj ( 15659 ) writes:
      
      ./C makes that ok, but that's not the problem. The problem is that you loose on one level of quotation.
  - Re: (Score:2)
    
    by jandrese ( 485 ) writes:
    
    Ironically, many of Microsoft's tools have trouble dealing with the space in the filename, including the blasted Run window.
    
    Just because there is a way to make it work doesn't means there isn't a problem with it. All unix shells can handle spaces in filenames, but the methods to do so are not always intuitive and it's easy to mess up things like shell scripts. Even the "proper" solutions have problems.
    
    And I can't stand "Program Files", what a mess that has been.
    - Re: (Score:2)
      
      by bzipitidoo ( 647217 ) writes:
      
      That's why I always create these 2 directories on Windows installations: "C:\Software" and "C:\Hardware". I change "Program Files" to "Software" in every installer that gives the user that option. Except driver software goes in Hardware. Quick way to sort out what I've installed from what something else installed. And it fits in 8 characters, in case that old limit is ever an issue.
    - - Re: (Score:2)
        
        by jandrese ( 485 ) writes:
        
        Bingo! You've discovered the basic problem with spaces in names, space is reserved as a delimiter and thus you're forced to quote anything you type that has a space in the name. It's the textbook example of the awkward workaround. If it were rare it wouldn't be a big deal (like on most Unix systems), but in Windows you end up having to use it all the damn time if you do any work at all on the commandline, even for simple operations. It's bad ergonomics.
- Re:Strange names (Score:5, Informative)
  
  by mytec ( 686565 ) writes: on Thursday December 08, 2011 @03:31PM (#38306540) Journal
  
  According to this paper [dartmouth.edu], they are called bgrep and bdiff.
  
  Parent Share
  twitter facebook
- - Re:Strange names (Score:5, Insightful)
    
    by Longjmp ( 632577 ) writes: on Thursday December 08, 2011 @03:31PM (#38306526)
    
    Definitely
    II mean, where would we end up if unix commands actually give a hint what they are doing ;-)
    As a unix novice, if I wanted to search for something, my first choice of course would be grep
    Also if I wanted help on something, the first word that jumps to my mind would be man
    
    heh.
    
    Parent Share
    twitter facebook
    - Comment removed (Score:5, Funny)
      
      by account_deleted ( 4530225 ) writes: on Thursday December 08, 2011 @03:53PM (#38306836)
      
      Comment removed based on user account deletion
      
      Parent Share
      twitter facebook
      - Re: (Score:2, Funny)
        
        by TheSpoom ( 715771 ) writes:
        
        Bonus points if the command is an inscrutable acronym that refers to itself.
        
        Re: (Score:3)
        
        by Anomalyst ( 742352 ) writes:
        
        dont forget the 'n' prefix to indicate the previous flavour is deprecated.
    - Re: (Score:2)
      
      by serviscope_minor ( 664417 ) writes:
      
      Also if I wanted help on something, the first word that jumps to my mind would be man
      If you want help, perhaps you should read the MMMAAANNNual.
      hint.
    - Re: (Score:2)
      
      by kelemvor4 ( 1980226 ) writes:
      
      Definitely II mean, where would we end up if unix commands actually give a hint what they are doing ;-) As a unix novice, if I wanted to search for something, my first choice of course would be grep Also if I wanted help on something, the first word that jumps to my mind would be man heh.
      It's a reasonable assumption that unix was designed specifically to be counter intuitive.
      - Re:Strange names (Score:5, Funny)
        
        by jd ( 1658 ) writes: <imipak@yaho[ ]om ['o.c' in gap]> on Thursday December 08, 2011 @05:44PM (#38308496) Homepage Journal
        
        You have to figure in two's complement notation. If it's sufficiently counter-intuitive, the sign bit flips over and it becomes totally intuitive.
        
        Parent Share
        twitter facebook
      - Re:Strange names (Score:5, Funny)
        
        by rk ( 6314 ) writes: on Thursday December 08, 2011 @08:51PM (#38310416) Journal
        
        Unix is user-friendly, it's just picky about who its friends are.
        
        Parent Share
        twitter facebook
      - Re: (Score:3)
        
        by jbolden ( 176878 ) writes:
        
        People were using terminals that were as slow as 110 baud. No one wanted to type extra characters.
    - Re:Strange names (Score:4, Interesting)
      
      by jejones ( 115979 ) writes: on Thursday December 08, 2011 @06:02PM (#38308742) Journal
      
      Alas, history and lots of shell scripts have probably made existing command names unchangeable. History in this case goes back to the time people got RSI from ASR-33 Teletypes and didn't want to have to type very much, and names that make sense only if you know other programs (in ed, "g//p" prints all lines containing the specified regular expression, hence the name "grep").
      That said, we programmers are users of programming languages as much as Joe Sixpack is a user of the desktop, and surely we deserve good design as much as they do, so we can get things done rather than taking perverse pride in mastering needlessly ghastly syntax.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by jejones ( 115979 ) writes:
        
        Shame on me for typing literal greater than and less than. That should have been "g/<regular expression>/p".
    - Re:Strange names (Score:5, Insightful)
      
      by morgauxo ( 974071 ) writes: on Thursday December 08, 2011 @06:07PM (#38308814)
      
      GP was a joke I am sure.
      
      As to yours though.. I wouldn't want spaces in my commands. How do you tell where the command ends and the arguments begin?
      
      As for man... man is the MANual. That's not that bad is it? Ok, help might be a little better but it's not a big deal unless you are very closed minded. It's really a history thing. Man wasn't just somebody's idea of a help command. Unix (or Unics as it was called back then) originally actually had a manual. As in dead trees paper! It got big. Real big. One day Dennis Ritchie accidentally dropped a copy and killed his dog. Flattened the poor girl like a pancake. After that he decided it needed to be digital. Man is a digital copy of that original dog killing book plus decades of additions and updates. Thus it is man(ual).
      
      Now should manual have been "manual" or maybe the real whole title "Unix Programmers Manual"? It might be easier to remember. 5 years after you learned that command and you are still typing it 5 times a day would you still appreciate the ease of using real whole English words? Are you that abc? (abreviationally challenged) Or do you just really love typing. Is your r/l name Mavis Beacon?
      
      That's how a lot of Unix commands are, they make plenty of sense with history. I'm sure grep and the others all have their own stories. Well.. not all. How much of a story does it take to come to ls is a lazy way to type list? Oh, yah, you are AbC. Sorry about that.
      
      Yes, the history of decades old programming decisions isn't really something you want to learn to use an OS (or any other software). But what's the alternative? Throw everything out x number of years and start over? It sounds great when you are a hopeless newbie but once you actually learn something do you want to do it all over again every 10 years just to make it easy for the next batch of basement kiddies? Your clock is ticking too you know! Now get off my lwn!!!! (lawn)
      
      P.S. Ok, Ok, I made up the dog part of the story. But it COULD have happened! The rest was real. Actually, I don't KNOW that it didn't happen... hmm....
      
      Parent Share
      twitter facebook
  - Re:Strange names (Score:5, Insightful)
    
    by mfnickster ( 182520 ) writes: on Thursday December 08, 2011 @04:36PM (#38307526)
    There's nothing that says the name of the tool and the command you type must be the same
    Very true. Unix programmers seem to follow these rules:
    
    delete any spaces in the name
    
    delete any vowels in the name
    
    delete any superfluous consonants
    
    chuck the entire thing and just abbreviate it to the first letter of each word in the name
    
    So these tools will likely be run as "ctxtfrgrp" and "hierdiff" or just "cfgrep" and "hdiff"
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by urdak ( 457938 ) writes:
        
        Like 'cat' for concatenate, or vi for what exactly?
        "vi" is short of "visual".
        First there was "ed", the, you guessed it, "editor". But "ed" was a real pain to use, because you wouldn't see what you were actually editing (if you ever used ed, you'd know what I mean). So the "visual" editor "vi" was invented.
      - Re: (Score:2)
        
        by jejones ( 115979 ) writes:
        
        "Catenate" is actually a word and means the same thing as "concatenate". Unfortunately, 1 - epsilon of people associate "cat" with F. domesticus, so "cat" was a really lousy choice.
        
        Re: (Score:3)
        
        by smellotron ( 1039250 ) writes:
        
        "cat" was a really lousy choice.
        
        The distinguished artist sees "cat" as an excellent choice—a palette for the creative file-namer, a mad-lib left incomplete!
        At least, that's how I justify log files named dog and crap.
        
        Re: (Score:3)
        
        by smellotron ( 1039250 ) writes:
        
        How much can you improve a 100 line program that does nothing by concatenate streams?
        
        Make it a shell built-in and chide the user if only a single input was used [partmaps.org] (e.g. cat file | grep blah).
How's it compare to Meld? (Score:2)

by Compaqt ( 1758360 ) writes:

A nice GUI diff for Linux. (Has 3-way).
Click here to install [deb]
- Re: (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page [sourceforge.net] would have been a bit more useful.
  - Re: (Score:3)
    
    by Compaqt ( 1758360 ) writes:
    
    Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").
    Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).
    Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.
- Re: (Score:2)
  
  by garry_g ( 106621 ) writes:
  
  Or ASCII GUI: vimdiff ... works fine, also with 3 files ...
- Re: (Score:2)
  
  by pak9rabid ( 1011935 ) writes:
  
  I like kompare [caffeinated.me.uk].
awk? (Score:2)

by realyendor ( 32515 ) writes:

Done! It's called "awk". Just set the RS and FS fields as appropriate. :P
- Perl (Score:3)
  
  by wdef ( 1050680 ) writes:
  
  Perl can context grep any ****ing thing any which way from Sunday. Much easier and more powerful than awk.
Follow the money...? (Score:2, Interesting)

by dzfoo ( 772245 ) writes:

funded in part by Google and the U.S. Energy Department
I wonder what's the interest of these two in this.
-dZ.
- RTFA? (Score:5, Informative)
  
  by DragonWriter ( 970822 ) writes: on Thursday December 08, 2011 @03:19PM (#38306394)
  
  funded in part by Google and the U.S. Energy Department
  I wonder what's the interest of these two in this.
  FTFA:
  Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by neokushan ( 932374 ) writes:
  
  I wonder why people feel the need to "sign" their posts, when their username is quite clearly visible at the top.
  -nk
- - Re:Follow the money...? (Score:5, Insightful)
    
    by Tanktalus ( 794810 ) writes: on Thursday December 08, 2011 @03:13PM (#38306334) Journal
    
    Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).
    If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.
    So, my question is, why aren't more IT-heavy corporations/government departments not involved?
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by hedwards ( 940851 ) writes:
      
      Why does that necessitate screwing around with grep? I can sort of see modifying diff, but with grep if you need that data you'd write a new program to parse it and pipe it.
    - Re: (Score:3)
      
      by bobaferret ( 513897 ) writes:
      
      So weird. I spent the last 6 months writing some Java libraries that do exactly this. There were some similar things out there, but they weren't licensed appropriately for my uses, or were WAY too expensive. Writing a hierarchical diff engine is the most complex thing I've ever done, hell writing an efficient pure diff engine is insane itself. You have to identify blocks/structure. then you have to diff the structures, then you have to diff the content in the structures. Once all of that is said and done th
      - Re: (Score:3)
        
        by bobaferret ( 513897 ) writes:
        
        LOL and that my friend is the hard part. It cost me $4000 in legal fees to make sure they are not owned by the company I work for, and 6 weeks of work. I'm leaning towards an AGPL/open core model. I just see so many people NOT happy with open core stuff. Also, I didn't get a grant from Google or the D.O.E. And these are just small, yet integral, parts of a larger system. That I don't really want to give away yet. Hell, deciding on licensing is harder than coding sometimes. Gotta feed the family you know,
        
        Re:Follow the money...? (Score:4, Interesting)
        
        by bobaferret ( 513897 ) writes: on Thursday December 08, 2011 @05:41PM (#38308450)
        
        I wouldn't call it a cancer. But it's definitely useful if you don't ever want commercial companies to use your code in public. It matches up well with the open core model. Commercial people will only use it if you can give them a differently licensed copy of the code. Apache, MIT, and BSD are great if you truly want to give your code away and don't care what people do with it behind closed doors. AGPL is nice to make sure people always give back. LGPL and GPL nice if you only want them to give back if they change it. Should people pay and how much is an age old question. I have to balance the cost of support and development vs. the cost of the product. The more I lean on the community the less I can charge and the more exposure I get. While in the other direction I get more money, but have to spend more of it. And there is no one size fits all solution to any of this.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by bobaferret ( 513897 ) writes:
        
        A new years resolution it is then...
  - Re: (Score:3)
    
    by Doc Ruby ( 173196 ) writes:
    
    Vast amounts of OS SW has been funded by the government. BSD was developed by UC Berkeley, which is largely funded by Pentagon contracts.
    And the Internet.
    Meanwhile, the vast majority of open source projects never get past the opening statement.
    You clearly don't know what it takes to accomplish a project like this one. What have you ever done, that gives you some standing to announce that this Usenix project is a load of crap?
    - Re: (Score:2)
      
      by RocketRabbit ( 830691 ) writes:
      
      I'm no doctor but I can tell that wound is infected.
Interesting... (Score:3, Interesting)

by DangerOnTheRanger ( 2373156 ) writes: on Thursday December 08, 2011 @02:51PM (#38306052) Homepage Journal

With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.

Share
twitter facebook
No download link? (Score:2)

by roguegramma ( 982660 ) writes:

I would have wished for a download link ..
Link to one of their papers on these tools (Score:5, Informative)

by treerex ( 743007 ) writes: on Thursday December 08, 2011 @03:07PM (#38306266) Homepage

http://www.cs.dartmouth.edu/reports/TR2011-705.pdf [dartmouth.edu]

Share
twitter facebook
Almost vaporware (Score:2)

by gmuslera ( 3436 ) * writes:

The grep is "in design process", the diff is "not released yet". And should be a lot of alternative tools to those 2, some that should have go around the same goal (i.e. mailgrep). Im all for improving those 2 venerable tools, but the announcement look a bit of out of time or scale.
sgrep (Score:2)

by SgtChaireBourne ( 457691 ) writes:

There used to be a utility, sgrep [helsinki.fi], for searching SGML/XML.
Object grep (Score:2)

by Doc Ruby ( 173196 ) writes:

I'd like a grep tool that could scan XML data for instances of objects (according to some XSD or DTD), and take object state values as arguments to search objects for.
If it could scan objects in memory I'd love that better, but XML seems the only likely candidate for a format that a universal tool would parse.
- Re: (Score:3)
  
  by jd ( 1658 ) writes:
  
  XML is ok, but there are many data formats that could really use a diff/grep utility that could make sense of them. HDF5 and NetCDF are nice in the scientific community, for example. Computer graphics geeks might find intelligent diff/grep tools for the Renderman format to be useful. Office users might want to know if two documents are genuinely different or were compressed differently. Hell, it would be incredibly useful if they could diff a MS Office file and LibreOffice file in their native formats to se
Ooooh! (Score:4, Interesting)

by gstoddart ( 321705 ) writes: on Thursday December 08, 2011 @03:25PM (#38306464) Homepage

As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar [wikipedia.org].
That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.
Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.
Sounds really cool.

Share
twitter facebook
- Re: (Score:2)
  
  by skids ( 119237 ) writes:
  
  It will be interesting to see what they come up with. From the paper posted above it looks like it will definitely be taking "wisdom" about certain file types, but I hope they also work on some fuzzy guessing modes as well that do not require prior knowledge of the language being parsed.
  The main potential for ick factor is whether they can manage to get a set of commandline flags that can be used/learned incrementally so you don't have to memorize a ream of flags just to get something useful done, and can
- Re: (Score:2)
  
  by steelfood ( 895457 ) writes:
  
  One of the reasons context-free searches isn't more prevalent in everyday computing is the increase in complexity and thus computation resources needed to process it. If regular grammar was linear, then context-free is closer to linearithmic (n*log(n)).
  Regular expressions can handle multi-line searches as easily as it does single-line searches (some people above were saying how the multi-line aspect of this would be real useful). The line delimiter is merely a convenience.
  What it can't handle are nested sea
Microsoft Ad (Score:4, Interesting)

by lucm ( 889690 ) writes: on Thursday December 08, 2011 @03:38PM (#38306628)

I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

Share
twitter facebook
- Re: (Score:2, Funny)
  
  by Mars Saxman ( 1745 ) writes:
  
  That's great for all fifty people who use Powershell.
  - Re: (Score:2)
    
    by lucm ( 889690 ) writes:
    
    That's great for all fifty people who use Powershell.
    You might need to get a better "grep" on reality, it's not 2004 anymore.
- Oh, to suffer the slings and arrows... (Score:2)
  
  by Tetsujin ( 103070 ) writes:
  
  I know I'll be modded down
  Dude, the only part of your post that I find objectionable is this assumption that you're going to be crucified for posting your thoughts. I know that there are some people on Slashdot who are pretty predictably triggered to shout down certain opinions - just don't assume that everyone here is like that, OK?
  I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what the
  - Re: (Score:2)
    
    by lucm ( 889690 ) writes:
    
    I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what they describe bgrep and bdiff doing could be accomplished in Powershell. I've never been too clear on some of the particulars of how that would be done, though. As I understand it, you can search/filter either XML data streams, or a sequence of .NET objects. Would the way to accomplish this in .NET, then, be to have a commandlet that opens the source file and passes them through as .NET objects? It would be a bit less compact than having the special type handling right in the "find" or "filter" command but it does lend a certain clarity to things, too...
    Powershell and .Net can collaborate both ways. As an example, many recent Microsoft products are using .Net for the GUI but in the backend all the actual work is done in Powershell. The opposite is true - one can plug a .Net component in a Powershell script, as an example to do a custom filter.
    The best example for the PS pipe model is with the VMWare extensions (PowerCLI). You can get a full inventory by writing a script like this:
    Get-Host | Get-Vm | Export-Csv c:\myinventory.txt
    In the CSV you get all the p
- Re: (Score:2)
  
  by grcumb ( 781340 ) writes:
  
  I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.
  Sure, and it's been possible in Perl (for example) forever:
  use File::Slurp; my $multi_line_pattern = join("", @ARGV); my $text = read_file( 'filename' ) ; if ($text =~ /$multi_line_pattern/){ # do something useful. }
  The only problem with the above is that it fails in anything other than trivial situations.
  The issue isn't passing things through filters, it's doing so in a way that you don't have to write insanely devious and complex filters. This grep tool is still only at the design stage, so I'll
- - Powershell envy (Score:2)
    
    by Tetsujin ( 103070 ) writes:
    
    So what? Maybe people want a non-proprietary solution that works on more than one OS.
    If there are such people, and it's not just me, I'd love to oblige them. :) I really need to get crackin'...
darcs (Score:2)

by bcrowell ( 177657 ) writes:

There's a version-control system called darcs (written by the son of a colleague of mine) that incorporates some interesting ideas along these lines. For example, say you have a program with 100,000 lines of code, and there's a function in it called Foo, which is called thousands and thousands of times. You want to change the name of the function to Bar. In a traditional diff-based system, this results in thousands of differences. Darcs is supposed to be able to handle changes like this and recognize that i
existing tools and suggestion (Score:2)

by khipu ( 2511498 ) writes:

PCRE has recursive patterns (available as pcregrep) and .NET has balancing groups, also allowing grep-like operations involving context-free grammars. For XML data, there are various XML query languages that allow wonderfully complex queries over XML structures. There are also refactoring tools that allow syntax-aware searches across source files.
For diff, the situation is a bit more complicated. There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based
- Re: (Score:2)
  
  by Tetsujin ( 103070 ) writes:
  
  There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based diff tools already. It seems difficult to come up with something more generic. Let's say you want to diff programming language source files in languages for which there is no diff tools. What good is a context free diff tool going to be? You'd need to specify the entire grammar for the language.
  I don't know if you didn't read this bit, or if I've misunderstood your post... But the basic (proposed?) approach of bgrep and bdiff is to provide a plug-in mechanism that would be used to extend the tools to new languages and data types. So, yes, you would have to specify the entire grammar for the language, but you'd only do it once... Or preferably, someone would have already done it for you. :)
  - Re: (Score:2)
    
    by khipu ( 2511498 ) writes:
    
    The use cases, options, and interfaces are different for searching programming language source files, XML files, and other text. So, you really need at least three tools: bgrep-lang, bgrep-xml, and bgrep text. Each of those might then have a plugin mechanism. But these three classes of tools already exist. Trying to force them into a single command line tool makes little sense to me.
    "bgrep-text" is just pcregrep.
    "bgrep-xml" is any one of a number of XML query and search tools, using XQuery or similar la
Terrible idea (Score:5, Insightful)

by deblau ( 68023 ) writes: <slashdot.25.flickboy@spamgourmet.com> on Thursday December 08, 2011 @04:03PM (#38306990) Journal

This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...
FTFA:
Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.

If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT [wikipedia.org] off the shelf. It's already customizable to whatever data format you're working with.
FTFA:
With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.

Share
twitter facebook
- Re: (Score:2)
  
  by bussdriver ( 620565 ) writes:
  
  no big deal; these features can become a FLAG option on grep and diff and I wouldn't care-- it could be useful; I just use perl for most stuff and grep for only really simple stuff. Ok, I'd probably still use perl... its my swiss army pocket knife I use to hammer everything ;-)
- 10 changes have been made - Disagree (Score:2)
  
  by roguegramma ( 982660 ) writes:
  
  Suppose you signal the nesting level by indentation, as most programmers today do.
  If you add a condition around some code, then for example 3 lines might indented, resulting in 5 lines being altered instead of the 2 which actually have changed.
  For this, the proposed improved grep and diff might be good, at least better than the current state of diff. Okay, maybe I'm not telling about the -b flag, but the -b flag might be a problem if you code in whitespace or so ;-)
  http://en.wikipedia.org/wiki/Whitespace_(p [wikipedia.org]
- Re: (Score:3)
  
  by Tetsujin ( 103070 ) writes:
  
  This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...
  I'll take this on. It's a subject that is of particular interest to me.
  First of all, you have to consider whether it even matters that a tool violates "rules" of the "Unix philosophy". I mean, seriously, why assume that some system design ideas cooked up 30-40 years ago are necessarily the One True Path? Because "those who do not understand Unix are doomed to reinvent it poorly"? What if the designers in question do understand Unix? Or what if <gasp> they might actually have some ideas that surpa
Structural Regular Expressions (Score:2)

by vAltyR ( 1783466 ) writes:

This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. [cat-v.org] I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
TXR! (Score:2)

by Kaz Kylheku ( 1484 ) writes:

I'm also working on a text processing tool that deals with blocks of data is already here.
http://www.nongnu.org/txr [nongnu.org]
- Re: (Score:2)
  
  by GameboyRMH ( 1153867 ) writes:
  
  Well if we can use our computers more efficiently then we'll save energy. On the other hand I can't imagine what use the DOD would have for this, especially since they seem to run Windows at every opportunity...
  - Re: (Score:2)
    
    by amiga3D ( 567632 ) writes:
    
    I think maybe some of the scientist types at the DOE were behind the funding.
- Re: (Score:3, Interesting)
  
  by iced_tea ( 588173 ) writes:
  
  They have HUGE amounts of data kicking around from various simulations/experiments.
  
  Check out the wikipedia article for supercomputers [wikipedia.org], and you'll see DOE mentioned.
  
  Tools like this could help with analysis and finding certain data sets. IIRC, regex are already used in DNA sequencing. There is probably a similar application and use for tools like this with their data.
- Re: (Score:3)
  
  by interval1066 ( 668936 ) writes:
  
  Do we really need to improve on something that works already? A grep that handles binary formats might be nice, but I think I'd rather see this spun off into some kind of new tool or two, like an "extended" grep and diff, maybe. Maybe they're doing that.
  - Re: (Score:3)
    
    by gstoddart ( 321705 ) writes:
    
    Do we really need to improve on something that works already?
    
    This would work, but better. No, I'm not being flippant.
    If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.
    I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match
- - Re: (Score:2)
    
    by Anne Thwacks ( 531696 ) writes:
    
    They're not going to break into your house and apt-get remove grep
    Are you sure?
    They can probably do it remotely on must OS's anyway. Quick - make friends with Theo.
- Perl to wa chigau no da yo! Perl to wa! (Score:2)
  
  by Tetsujin ( 103070 ) writes:
  
  Why do we need to write another perl?
  Is it really "writing another perl"? The meat of these tools (which, I think, aren't yet implemented?) is that they filter and compare parsed data structures - and provide plug-in hooks so people can insert parsers for additional data types. Certainly this could be done as a Perl library - and doing so might have some advantages over creating new tools with their own plug-in mechanism. But implementing bgrep and bdiff is nowhere close to "writing another perl".

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Strange names (Score:4, Funny)

Re:Strange names (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re:Strange names (Score:4, Informative)

Re: (Score:3)

Re:Strange names (Score:5, Funny)

Re:Strange names (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Strange names (Score:4, Funny)

They should call it... (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

the perl man page (Score:3)

Subject line is not part of the comment (Score:2)

Re: (Score:3)

Re: (Score:2)

Re:Strange names (Score:4, Interesting)

Re:Strange names (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Strange names (Score:5, Informative)

Re:Strange names (Score:5, Insightful)

Comment removed (Score:5, Funny)

Re: (Score:2, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:Strange names (Score:5, Funny)

Re:Strange names (Score:5, Funny)

Re: (Score:3)

Re:Strange names (Score:4, Interesting)

Re: (Score:2)

Re:Strange names (Score:5, Insightful)

Re:Strange names (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

How's it compare to Meld? (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

awk? (Score:2)

Perl (Score:3)

Follow the money...? (Score:2, Interesting)

RTFA? (Score:5, Informative)

Re: (Score:2)

Re:Follow the money...? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re:Follow the money...? (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Interesting... (Score:3, Interesting)

No download link? (Score:2)

Link to one of their papers on these tools (Score:5, Informative)

Almost vaporware (Score:2)

sgrep (Score:2)

Object grep (Score:2)

Re: (Score:3)

Ooooh! (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Microsoft Ad (Score:4, Interesting)

Re: (Score:2, Funny)

Re: (Score:2)

Oh, to suffer the slings and arrows... (Score:2)