Slashdot is powered by your submissions, so send in your scoop


Forgot your password?
Google Unix Linux

Researchers Expanding Diff, Grep Unix Tools 276

itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."
This discussion has been archived. No new comments can be posted.

Researchers Expanding Diff, Grep Unix Tools

Comments Filter:
  • by gnasher719 ( 869701 ) on Thursday December 08, 2011 @02:46PM (#38305988)
    Space characters in the name of a Unix command line tool is asking for trouble.
    • Re:Strange names (Score:4, Insightful)

      by realyendor ( 32515 ) on Thursday December 08, 2011 @02:51PM (#38306056)

      I expect those are just the spoken names and that the commands will still be single words, similar to:
      "GNU awk" -> gawk
      "enhanced grep" -> egrep

    • If the FS supports spaces in filenames, then you have broken code if you can't tolerate it. MS wisely put a space in the "Program Files" name when they added long filenames to Windows. That'll put any delusions about being able to ignore it to a direct immediate stop.

      • Re:Strange names (Score:4, Interesting)

        by adonoman ( 624929 ) on Thursday December 08, 2011 @03:12PM (#38306322)

        But having to use quotes every time you call a command is a sure way to make sure your command is never used.

        Would you rather type this:
        ./"Context-Free Grep" ...
        or this:
        ./cfgrep ..

        • Re:Strange names (Score:5, Insightful)

          by iluvcapra ( 782887 ) on Thursday December 08, 2011 @03:48PM (#38306752)

          If you don't like a tool's name, export an alias.

          It's not about typing commands as much as it's about making these work:

          $ find . -name ".txt" | xargs wc
          $ for file in $*; do
          mv $file old/$file

          Versus these:

          $ find . -name ".txt" -print0 | xargs -0 wc
          $ for file in $*; do
          mv "$file" "old/$file"

          A lot of scripts you run into are just broken because of braindead assumptions.

        • by gangien ( 151940 )

          in scripts, i pretty much quote everything. seems to be the way to avoid problems. of course, i'm not a sysadmin by trade, so maybe it's bad for some reason or something.

          when at the prompt i hit tab.

          We'd probably avoid a lot of problems, if people wouldn't be so lazy to not type a few extra characters.

        • by emj ( 15659 )

          ./C makes that ok, but that's not the problem. The problem is that you loose on one level of quotation.

      • by jandrese ( 485 )
        Ironically, many of Microsoft's tools have trouble dealing with the space in the filename, including the blasted Run window.

        Just because there is a way to make it work doesn't means there isn't a problem with it. All unix shells can handle spaces in filenames, but the methods to do so are not always intuitive and it's easy to mess up things like shell scripts. Even the "proper" solutions have problems.

        And I can't stand "Program Files", what a mess that has been.
        • That's why I always create these 2 directories on Windows installations: "C:\Software" and "C:\Hardware". I change "Program Files" to "Software" in every installer that gives the user that option. Except driver software goes in Hardware. Quick way to sort out what I've installed from what something else installed. And it fits in 8 characters, in case that old limit is ever an issue.
    • Re:Strange names (Score:5, Informative)

      by mytec ( 686565 ) on Thursday December 08, 2011 @03:31PM (#38306540) Journal
      According to this paper [], they are called bgrep and bdiff.
  • A nice GUI diff for Linux. (Has 3-way).

    Click here to install [deb]

    • Re: (Score:3, Insightful)

      by Anonymous Coward

      It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page [] would have been a bit more useful.

      • Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").

        Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).

        Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.

    • by garry_g ( 106621 )

      Or ASCII GUI: vimdiff ... works fine, also with 3 files ...

    • I like kompare [].
  • Done! It's called "awk". Just set the RS and FS fields as appropriate. :P

    • by wdef ( 1050680 )
      Perl can context grep any ****ing thing any which way from Sunday. Much easier and more powerful than awk.
  • Follow the money...? (Score:2, Interesting)

    by dzfoo ( 772245 )

    funded in part by Google and the U.S. Energy Department

    I wonder what's the interest of these two in this.


    • RTFA? (Score:5, Informative)

      by DragonWriter ( 970822 ) on Thursday December 08, 2011 @03:19PM (#38306394)

      funded in part by Google and the U.S. Energy Department

      I wonder what's the interest of these two in this.


      Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.

    • I wonder why people feel the need to "sign" their posts, when their username is quite clearly visible at the top.


  • Interesting... (Score:3, Interesting)

    by DangerOnTheRanger ( 2373156 ) on Thursday December 08, 2011 @02:51PM (#38306052) Homepage Journal
    With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.
  • I would have wished for a download link ..
  • The grep is "in design process", the diff is "not released yet". And should be a lot of alternative tools to those 2, some that should have go around the same goal (i.e. mailgrep). Im all for improving those 2 venerable tools, but the announcement look a bit of out of time or scale.
  • There used to be a utility, sgrep [], for searching SGML/XML.
  • I'd like a grep tool that could scan XML data for instances of objects (according to some XSD or DTD), and take object state values as arguments to search objects for.

    If it could scan objects in memory I'd love that better, but XML seems the only likely candidate for a format that a universal tool would parse.

    • by jd ( 1658 )

      XML is ok, but there are many data formats that could really use a diff/grep utility that could make sense of them. HDF5 and NetCDF are nice in the scientific community, for example. Computer graphics geeks might find intelligent diff/grep tools for the Renderman format to be useful. Office users might want to know if two documents are genuinely different or were compressed differently. Hell, it would be incredibly useful if they could diff a MS Office file and LibreOffice file in their native formats to se

  • Ooooh! (Score:4, Interesting)

    by gstoddart ( 321705 ) on Thursday December 08, 2011 @03:25PM (#38306464) Homepage

    As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar [].

    That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.

    Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.

    Sounds really cool.

    • by skids ( 119237 )

      It will be interesting to see what they come up with. From the paper posted above it looks like it will definitely be taking "wisdom" about certain file types, but I hope they also work on some fuzzy guessing modes as well that do not require prior knowledge of the language being parsed.

      The main potential for ick factor is whether they can manage to get a set of commandline flags that can be used/learned incrementally so you don't have to memorize a ream of flags just to get something useful done, and can

    • One of the reasons context-free searches isn't more prevalent in everyday computing is the increase in complexity and thus computation resources needed to process it. If regular grammar was linear, then context-free is closer to linearithmic (n*log(n)).

      Regular expressions can handle multi-line searches as easily as it does single-line searches (some people above were saying how the multi-line aspect of this would be real useful). The line delimiter is merely a convenience.

      What it can't handle are nested sea

  • Microsoft Ad (Score:4, Interesting)

    by lucm ( 889690 ) on Thursday December 08, 2011 @03:38PM (#38306628)

    I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

    • Re: (Score:2, Funny)

      by Mars Saxman ( 1745 )

      That's great for all fifty people who use Powershell.

      • by lucm ( 889690 )

        That's great for all fifty people who use Powershell.

        You might need to get a better "grep" on reality, it's not 2004 anymore.

    • I know I'll be modded down

      Dude, the only part of your post that I find objectionable is this assumption that you're going to be crucified for posting your thoughts. I know that there are some people on Slashdot who are pretty predictably triggered to shout down certain opinions - just don't assume that everyone here is like that, OK?

      I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what the

      • by lucm ( 889690 )

        I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what they describe bgrep and bdiff doing could be accomplished in Powershell. I've never been too clear on some of the particulars of how that would be done, though. As I understand it, you can search/filter either XML data streams, or a sequence of .NET objects. Would the way to accomplish this in .NET, then, be to have a commandlet that opens the source file and passes them through as .NET objects? It would be a bit less compact than having the special type handling right in the "find" or "filter" command but it does lend a certain clarity to things, too...

        Powershell and .Net can collaborate both ways. As an example, many recent Microsoft products are using .Net for the GUI but in the backend all the actual work is done in Powershell. The opposite is true - one can plug a .Net component in a Powershell script, as an example to do a custom filter.

        The best example for the PS pipe model is with the VMWare extensions (PowerCLI). You can get a full inventory by writing a script like this:
        Get-Host | Get-Vm | Export-Csv c:\myinventory.txt

        In the CSV you get all the p

    • by grcumb ( 781340 )

      I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

      Sure, and it's been possible in Perl (for example) forever:

      use File::Slurp;

      my $multi_line_pattern = join("", @ARGV);
      my $text = read_file( 'filename' ) ;
      if ($text =~ /$multi_line_pattern/){
      # do something useful.

      The only problem with the above is that it fails in anything other than trivial situations.

      The issue isn't passing things through filters, it's doing so in a way that you don't have to write insanely devious and complex filters. This grep tool is still only at the design stage, so I'll

  • There's a version-control system called darcs (written by the son of a colleague of mine) that incorporates some interesting ideas along these lines. For example, say you have a program with 100,000 lines of code, and there's a function in it called Foo, which is called thousands and thousands of times. You want to change the name of the function to Bar. In a traditional diff-based system, this results in thousands of differences. Darcs is supposed to be able to handle changes like this and recognize that i

  • PCRE has recursive patterns (available as pcregrep) and .NET has balancing groups, also allowing grep-like operations involving context-free grammars. For XML data, there are various XML query languages that allow wonderfully complex queries over XML structures. There are also refactoring tools that allow syntax-aware searches across source files.

    For diff, the situation is a bit more complicated. There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based

    • There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based diff tools already. It seems difficult to come up with something more generic. Let's say you want to diff programming language source files in languages for which there is no diff tools. What good is a context free diff tool going to be? You'd need to specify the entire grammar for the language.

      I don't know if you didn't read this bit, or if I've misunderstood your post... But the basic (proposed?) approach of bgrep and bdiff is to provide a plug-in mechanism that would be used to extend the tools to new languages and data types. So, yes, you would have to specify the entire grammar for the language, but you'd only do it once... Or preferably, someone would have already done it for you. :)

      • by khipu ( 2511498 )

        The use cases, options, and interfaces are different for searching programming language source files, XML files, and other text. So, you really need at least three tools: bgrep-lang, bgrep-xml, and bgrep text. Each of those might then have a plugin mechanism. But these three classes of tools already exist. Trying to force them into a single command line tool makes little sense to me.

        "bgrep-text" is just pcregrep.

        "bgrep-xml" is any one of a number of XML query and search tools, using XQuery or similar la

  • Terrible idea (Score:5, Insightful)

    by deblau ( 68023 ) <> on Thursday December 08, 2011 @04:03PM (#38306990) Journal

    This violates so many rules of the Unix philosophy [] that I don't even know where to begin...


    Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.

    If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT [] off the shelf. It's already customizable to whatever data format you're working with.


    With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.

    No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.

    • no big deal; these features can become a FLAG option on grep and diff and I wouldn't care-- it could be useful; I just use perl for most stuff and grep for only really simple stuff. Ok, I'd probably still use perl... its my swiss army pocket knife I use to hammer everything ;-)

    • Suppose you signal the nesting level by indentation, as most programmers today do.

      If you add a condition around some code, then for example 3 lines might indented, resulting in 5 lines being altered instead of the 2 which actually have changed.

      For this, the proposed improved grep and diff might be good, at least better than the current state of diff. Okay, maybe I'm not telling about the -b flag, but the -b flag might be a problem if you code in whitespace or so ;-) []

    • This violates so many rules of the Unix philosophy [] that I don't even know where to begin...

      I'll take this on. It's a subject that is of particular interest to me.

      First of all, you have to consider whether it even matters that a tool violates "rules" of the "Unix philosophy". I mean, seriously, why assume that some system design ideas cooked up 30-40 years ago are necessarily the One True Path? Because "those who do not understand Unix are doomed to reinvent it poorly"? What if the designers in question do understand Unix? Or what if <gasp> they might actually have some ideas that surpa

  • This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. [] I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
  • I'm also working on a text processing tool that deals with blocks of data is already here. []

Some people manage by the book, even though they don't know who wrote the book or even what book.