Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Unix Programming

Unix Legend Adding Unicode Support To AWK - Once He Figures Out Git (arstechnica.com) 103

Co-creator of core Unix utility, now 80, just needs to run a few more tests. From a report: A Princeton professor, finding a little time for himself in the summer academic lull, emailed an old friend a couple months ago. Brian Kernighan said hello, asked how their US visit was going, and dropped off hundreds of lines of code that could add Unicode support for AWK, the text-parsing tool he helped create for Unix at Bell Labs in 1977. "I have tested this a fair amount but clearly more tests are needed," Kernighan wrote in the email, posted as a kind of pseduo-commit on the onetrueawk repo by longtime maintainer Arnold Robbins. "Once I figure out how ... I will try to submit a pull request. I wish I understood git better, but in spite of your help, I still don't have a proper understanding, so this may take a while." Kernighan is the "K" in AWK, a special-purpose language for extracting and manipulating language that was key to Unix's pipeline features and interoperability between systems. A working awk function (AWK is the language, awk the command to invoke it) is critical to both Standard UNIX Specification and IEEE POSIX certification for interoperability. There are countless variants of awk, but "One True AWK," sometimes known as nawk, is the version based on Kernighan's 1985 book The AWK Programming Language and his subsequent input.

Kernighan is also the "K" in "K&R C," the foundational 1978 book The C Programming Language he cowrote with Dennis Ritchie that sticks with programmers, mentally and in dog-eared paper form. C's roots go much deeper. Kernighan had been teaching C to workers at Bell Labs and convinced its creator, Dennis Ritchie, to collaborate on a book to spread the knowledge. That book gave birth to "the one true brace style," the endless debate that goes with it, and the structure underpinning every modern programming language. Kernighan also named Unix and first demonstrated the "Hello, world" code example.

This discussion has been archived. No new comments can be posted.

Unix Legend Adding Unicode Support To AWK - Once He Figures Out Git

Comments Filter:
  • by rsilvergun ( 571051 ) on Tuesday August 23, 2022 @02:04PM (#62814893)
    we'll finally have unicode support here.
    • we'll finally have unicode support here.

      Well, it seems like they did have it with the site re-design, that was so unpopular they pulled the whole thing and we never saw anything new again.

      I didn't think is was that bad personally, I just wish they had ket closer to the current UI design and simply added Unicode support to that.

      Maybe Slashdot unicode support still exists somewhere in a branch someone could dig up?

      Slashdot however strikes me as the ultimate example of "if it ain't broke don't fix it", so I w

      • It's precious that you think any of the current owners of /. actually know a single line of code for this site, or have made any meaningful changes other than shoving more ads down our throats in the last 5 years.

      • Testing this âoeUnicodeâ thing. How beautifully it translates the double quotes, and Iâ(TM)m especially impressed by the single quotes. These are probably the most useless characters anyway.

        • That is the modern mobile experience. I am sorry to say I do not write this on a mainframe console.

        • The efficiencies of Grep come from the reliable single byte character stride of the text. With Unicode there's too many alternate frame shifts with variable byte character representation . Not to mention all the escapes and modes.

          Grep on Unicode is pathetically slow

          • Unicode is only variable width in the UTF-8 encoding. Just translate the input into UTF-32, which is constant width, do your processing, and then output UTF-8.
            • In a way, UTF-32 is also variable width because a written character (called an "extended grapheme cluster" in the Unicode Standard) often consists of multiple code points, even if UTF-32 represents each code point with a fixed-size code unit. Or at least that's the impression I get from the UTF-8 Everywhere manifesto [utf8everywhere.org].

          • Grep on unicode or not is always the same speed, as bottom line it uses BYTES does not matter it 2, 4, 8 bytes are a unicode char or a single ascii char.

        • Theyâ(TM)ve saved lots of ££££s (thatâ(TM)s GBP sterling) by not fixing it

    • we'll finally have unicode support here.

      Unicode support has been in Perl for a long while. I suspect the limitation in /. is that the current owners don't care to spend the money on the necessary dev work. Slashdot at one point open sourced its engine as Slashcode, but I don't see any evidence the public project has been kept up to date or is leveraged much, if at all, any more?

      • by ls671 ( 1122017 )

        Well, the blue site has implemented unicode in Slashcode, maybe simply ask them to do the work for some money then. I still wonder about exploits related to unicode although, many must have been fixed but I suspect there might still be some lurking around. Anybody knows more than I do about the status on this?

        • I don't know about any blue site, but the people at the red site got Unicode into Slashcode.

        • Well, the blue site has implemented unicode in Slashcode, maybe simply ask them to do the work for some money then. I still wonder about exploits related to unicode although, many must have been fixed but I suspect there might still be some lurking around. Anybody knows more than I do about the status on this?

          I think I am out of the loop here. What is this blue site you are talking about?

      • How much Unicode support to you actually need to print "awk bailing out"? Come to think of it, why is the binary so large when that's all it does?
    • Too many people here like the fact that being so backwards as to not use proper literary quotes causes issues with the default iOS text behaviours. You can't take people's "smug" away or they'll resent it.

    • by tlhIngan ( 30335 ) <[ten.frow] [ta] [todhsals]> on Tuesday August 23, 2022 @05:16PM (#62815569)

      Internally slashcode supports Unicode, and has for decades.

      The problem is Unicode is problematic to support - you cannot just blindly treat all text someone types as innocent - there are plenty of codepoints that will cause you problems if you do not handle them.

      So the site has a Unicode whitelist of supported characters - basically the printable ASCII set, and as a protection against hacks, it also strips off the high bit.

      You can tell when a site owner naively implements Unicode support because their site is promptly made unreadable as everyone puts up a comment containing overdecorated characters, RTL control codes and other codepoints in that basically turn the site into a complete mess.

      Handling Unicode is hard. It's caused more than its fair share of problems through the years, including several notable Android and iOS crashes

      Anyhow, Brian Kernigan announced Unicode support on AWK during a Computerphile episode as well

      https://www.youtube.com/watch?... [youtube.com]

      • by AmiMoJo ( 196126 )

        This is one of the two major flaws in Unicode, the other being botched Chinese/Japanese/Korean support.

        Unicode should have been designed to make text processing easy, and defined any metadata needed to avoid the kinds of problems that Unicode support brings. Instead it's left to ever developer and most can't handle it, so you get a broken and/or limited implementation.

      • So the site has a Unicode whitelist of supported characters - basically the printable ASCII set, and as a protection against hacks, it also strips off the high bit.

        So knowing very little about unicode but putting my PHB hat on - does this mean that all that needs to happen is some junior developer identifies the top 80% of unicode characters used in comments in the past year (looking at comments rated 4 or 5 to avoid the spam), do a quick manual review and then add them to the whitelist?

        That sounds like a j

  • ...so, never? (Score:5, Interesting)

    by timeOday ( 582209 ) on Tuesday August 23, 2022 @02:09PM (#62814913)
    I hope he doesn't wait until after mastering git. Mostly you just try to get through it, like hacking through a jungle with a machete.
    • Re:...so, never? (Score:5, Insightful)

      by 93 Escort Wagon ( 326346 ) on Tuesday August 23, 2022 @02:13PM (#62814925)

      The more I use git, the less I feel I have a decent grasp of it.

      • by davidwr ( 791652 )

        The more I use git, the less I feel I have a decent grasp of it.

        The more you know, the more you know you don't know.

      • Re:...so, never? (Score:5, Insightful)

        by computer_tot ( 5285731 ) on Tuesday August 23, 2022 @02:34PM (#62815003)
        Same here. git must be one of the weirdest, least intuitive, hardest to master development tools I've ever used. Just when I think I've got a handle on it, it does something weird that borks my tree or refuses to merge a commit and I need to untangle it. Most days I'd prefer to be using just about anything else - cvs, subversion, emailed patches and filesystem snapshots.
        • Re:...so, never? (Score:5, Informative)

          by UnknownSoldier ( 67820 ) on Tuesday August 23, 2022 @04:06PM (#62815331)

          git is easy once you understand the 5 different locations code can be.

          This nice cheat sheet [ndpsoftware.com] shows the commands available for each location.

          * stash
          * workspace
          * index
          * local repository
          * remote repository

          • by raynet ( 51803 )

            You are forgetting the most important location, another folder without git (or with git).
            What I usually do is clone a repo and the rsync it to another folder where I will work on the code (often I do remove .git -folder and init a new blank git repo there). And periodically I take snapshots of said folder. Once I'm done with a feature, I will pull on the git folder, diffmerge all my changes there, add, commit, push. This way I never have to fight with any git problems like merges, conflicts etc. And also no

        • What most people don't understand is that git isn't source control.

          Git is a distributed merge system with history. It is absolutely perfect for Linus' workflow and an absolute piece of shit for everyone else.

          Once you understand it wasn't designed for what everyone is trying to use it for, the sooner the healing can begin.

      • Re:...so, never? (Score:4, Informative)

        by Darinbob ( 1142669 ) on Tuesday August 23, 2022 @03:05PM (#62815093)

        There are 3 parts of git really. The simple part that you use for daily stuff, which map relatively straight forward to other source code control tools. Then the part that is intended to be very low level commands that no human is expected to use directly, but which can be used for higher level tools. Then there's the last third that are in the middle: more advanced concepts, for experts maybe, or that involve more git-only concepts, and that third often trips people up.

        Ie, "rebase", some people use it every day, but it really is an advanced concept that is difficult to really to grok; it's like "merge" and some people even use it as a synonym for "merge", it's really like replacing the bullets you normally use to shoot yourself in the foot with explosive rounds.

        I gave the tutorial on git to the team, using it every work day for 3 years, and I *still* feel like a noob at it. Though I certainly like it better than the stuff I used preivously.

        • Ie, "rebase", some people use it every day, but it really is an advanced concept that is difficult to really to grok; it's like "merge" and some people even use it as a synonym for "merge", it's really like replacing the bullets you normally use to shoot yourself in the foot with explosive rounds.

          I've found that several things with git are much easier to understand if you keep in mind that commits are objects and a repository is a tree of commits. With that mental model, rebase is basically just taking a bunch of commits, detaching them from where they currently are on the tree, and reattaching them somewhere else on the tree. This is a significantly different from a merge, which combines a branch back into the part of the tree that the branch started from. If you draw pictures of the trees, you can

          • Except as a grandparent pointed out, rebase already had a meaning in revision control and it *wasn't* that. Rebase was basically an update to get others' changes.
    • Mostly you just try to get through it, like hacking through a jungle with a machete.

      Before I do anything with git, I back up my local source tree.

      That says enough about it.

  • by Khopesh ( 112447 ) on Tuesday August 23, 2022 @02:31PM (#62814991) Homepage Journal

    AWK doesn't implement curly-bracket quantifiers in its regular expressions, like .{4} or .{2,5}. This isn't a part of all posix Extended Regular Expression specifications, so it's not deemed "missing", but it is specified in the latest POSIX ERE spec [opengroup.org], but I'd prioritize parity with GNU grep's ERE implementation above other features. UTF-8 support and a proper CSV implementation (to support values with quoted commas) sound nice too.

    (Warning, I use mawk [invisible-island.net], not The One True AWK [github.com], aka "nawk", which is what Kerninghan is discussing.)

    The real question, however, is how quickly any changes to The One True AWK trickle into the Posix spec and the more popular implementations (GNU awk, aka "gawk", BSD, mawk, and Busybox).

  • by TigerPlish ( 174064 ) on Tuesday August 23, 2022 @02:55PM (#62815057)

    This is the kind of article which makes me smile. Real News for Nerds.

    Doubly so because it's about a true luminary of our hobby / business / life. I may have "transitioned" to mainly a vm / windows environment.. but this.. this is Stuff That Matters.

    Triply so because this gent is showing all of us that yes, old dogs do learn new tricks.. just a bit more slowly than a young pup.

    Eds... take note... more of this, less of politicking and divisiveness. The only divisions we give a fuck here about are *nix vs. Everything Else, and vi vs. emacs.

    Puh-leese. Pretty please. It doesn't escape all our notice that since the arab takeover, this place went to the dogs.

    We have what now.. 2 dozen active posters? vs. hundreds not too long ago.

    I wonder why!

    If I had the dosh, I'd buy slashdot. But I don't, so now I just stfu except when I read patently retarded bullshit.

    • by godrik ( 1287354 )

      The only divisions we give a fuck here about are *nix vs. Everything Else, and vi vs. emacs.

      That's not true! We care about Eucliean division too! :)

    • by Areyoukiddingme ( 1289470 ) on Tuesday August 23, 2022 @03:36PM (#62815167)

      The only divisions we give a fuck here about are *nix vs. Everything Else, and vi vs. emacs.

      Why we insist on comparing a text editor to an OS I'm sure I don't know. But everybody knows emacs is the best OS. Now it just needs a good text editor...

      • I find that PowerShell is a really good way to edit text files. I would suggest you install it and .net on all your *nix boxes to try out.

        There, did I properly engage your heart?

      • Sorry, posting to undo moderation. Was going for funny, but hit overrated....
    • My very first freshman college CS class used a book called Software Tools by Kernighan & Plauger. It was a CS dept class taught on a CDC and I think the language we were using was RATFOR. No C on a CDC back then. Thought it was great approach, Later taking EE classes and the EE dept used PDP's. So of course that was UNIX and was C, and of course we used the K&R C book. These guys were the architects of *nix. Heroes to me.
    • old dogs do learn new tricks.. just a bit more slowly than a young pup

      Maybe, maybe not. But the greybeard is far more likely to admit that learning this or that is hard.

      • Plenty of grey here. Still learnin'. Yeah it's slower, but.. it's still doable. Hardest part for me is to turn off the scatterbrain, which means caffeine and conscious effort.

        Violin is bloody ruthless, for one. I started almost exactly one year ago. I watch how fast the kids learn the same instrument, and I go x.x

        But I don't let it stop me.

        That's why this article caught my eye. The guy's 80.

        You want a real freak? Martha Argerich. She's 80, and she's better now than she's ever been. The older she g

        • by havana9 ( 101033 )
          Plenty of grey here too. I think the big differnce between being a kid and being an adult it's time. As a kid you don't have to work and to do errands, pass time with your kids and so on. Another thing i noticed is that teacher and most educational materials are geared to kids and not for adults. Language courses and book have a different structure if meant for adults or kids.
    • by Saffaya ( 702234 )

      > The only divisions we give a fuck here about are *nix vs. Everything Else, and vi vs. emacs.

      I think we are also partisan to the systemd versus everything else debate.

    • by kbahey ( 102895 )

      It doesn't escape all our notice that since the arab takeover, this place went to the dogs.

      I fully agree that the quality of articles has declined a lot.

      But, what "arab takeover"?

  • OTBS?

    No thanks, I prefer Whitesmiths.

    But fortunately, because of astyle, everyone can edit any shared code using their preferred brace style, regardless of what it is.

    I honestly don't see why everyone doesn't use Whitesmiths, though. :)

    • by shoor ( 33382 )

      Heh, I see there's now an extensive wikipedia article on the different styles. I don't see why everybody doesn't use allman. And I like to include a comment after the last brace that alludes to the complementary opening brace, i.e.:

      while (a == b)
      { ....
      } /* end while a == b */

      But that's me.

      • Heh, I see there's now an extensive wikipedia article on the different styles. I don't see why everybody doesn't use allman.

        I second your endorsement of Allman style [wikipedia.org].

        If Slashdot hasn't done a poll on this topic, it would be a a fun one to do.

        • There's a lot of terrible styles out there, and the Linux kernel one is just about the worst.
          • There's a lot of terrible styles out there, and the Linux kernel one is just about the worst.

            I had a look at it, and I'll say one thing. I really don't like the look of a closing brace followed by an opening brace on the same line, like "} else {".

            • by shoor ( 33382 )

              Having a brace style that goes "} else {" really limits the ability to include a comment also. I gave an example in my previous post,
              but with else statements I would have comments something like:

              if (temp >= setpoint())
              {
                        temp = lowertemp();
              } else /* temp below set point */
              {
                      checkotherstuff();
              } /* end else temp at or above setpoint */

              • aargh, I'll jump on this before anybody else does,

                That last comment should be
                } /* end else temp below setpoint */

                BTW, being an old command line type, I could appreciate awk. I never really needed to do that much stuff of the kind that awk is good for, so I never quite got it in muscle memory. I was more of a sed guy because of the similarities to vi.

  • If I have to slice and dice output I will use Perl which is the Platonic Ideal of all those clumsy arcane tools.

    • While you can do almost anything with Perl, the elegance of piping awk and sed won me over. I had something that generated about 1 million data points per day, and essentially needed a quick way to consolidate that down to a single metric. It was easy and painless to learn enough to do it with AWK/SED, and wicked fast. Perl had been my go-to, but it was much more clumsy for this task.

    • I tried Perl several times but hated it. It relies too much on little tricks and exceptions, which makes it miserable to read and modify. I still use AWK, though for some things I use Tcl or Python. And Lua has a wonderful regexp library.
    • Awk is excellent for columnar data from arbitrary tools:


      dir | awk '{print $3}' # extract the third column of values

      0
      4
      68
      4
      4
      4
      4
      40
      4
      4
      4
      796
      4
      4
      4
      16
      12

      I know it can do more, that's enough to keep it in my toolbox.

  • With git, I know what I want to do but most of the time fail to it done. At least now I feel less alone, I proudly suffer the same issues Brien Kernighan does.
  • Most languages I can essentially not touch or think about for years and come back and generally pick right back up where I left off. Not so with Perl. Any time I ever have to write any code in Perl, I have to find a tutorial and completely relearn the language from the ground up just to understand any code I had previously written let alone begin writing new code in Perl.

    Same thing's true with git. If it is anything outside of basic git add .; git commit -m "Message"; git push origin master/main, then I'

  • Ages ago -- early to mid 80s I think? -- I almost bought a ref book on GAWK, printed by MIT Press, purely for the cover art. It was a cartoonish image of a gnu, gawking at something. Head tilted to side, goggle-eyed. No amount of googling or d'duckgoing has been able to find that image. Surely it must exist somewhere still???
  • Interestingly enough, I learned all I know about Awk from the O'Reilly book "Sed and Awk" which Mr. Robbins co-wrote along with Dale Dougherty. I don't even know if O'Reilly still puts out new editions of it. Sadly, I don't use Awk much anymore, one day I might do some more stuff with Awk and vi. I tell ya though, Brian Kernighan is a genius, having co-created Unix and Awk, and creating RATFOR and the first "Hello world" program. And we can't forget the thing he's most known for: The C Programming Language,
  • I was always ashamed to admit I didn't understand git. Now not so much anymore.
    Thanks, Brian, for being relevant still.

"Gravitation cannot be held responsible for people falling in love." -- Albert Einstein

Working...