Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet Hardware

The Google Search Server 178

An anonymous reader submitted a reasonably indepth review of the Google search appliance. The guys from anandtech put it through it's paces, and included a variety of pictures and comments on one of those Google products most of us will probably never play with.
This discussion has been archived. No new comments can be posted.

The Google Search Server

Comments Filter:
  • by DeadSea ( 69598 ) * on Tuesday September 06, 2005 @11:12AM (#13489948) Homepage Journal
    The Mini considers any unique URL string to be a unique document, which makes sense (but is a bit surprising the first time that you run an index). After four hours of indexing, the Mini had managed to reach its document limit and we had to improvise.
    Anybody who doesn't know that search engines consider each url to contain a unique document does't know much about getting their site to be properly represented in search engines.

    Their solution was to create a list of urls for the appliance to crawl. If they had to do that for the search appliance, there is no way that googlebot, msnbot, or yahoo slurp is going to be able to properly index their site.

    Your public accessable urls need to managed and canonicalized through judicious use of robots.txt, 302 redirects, site wide linking, and just plain thinking out the layout of your site.

  • Was this a review? (Score:2, Informative)

    by defkkon ( 712076 ) on Tuesday September 06, 2005 @11:12AM (#13489949)
    Was this a hardware review, or was this an instruction manual?

    I gotta say, I was looking for benchmarks, usability scores, maybe some test scenarios. Even better, compare this to other products available out there.

    It looked promising at the start, but when you get to the last page it leaves you wondering if they forgot the hyperlinks for the rest of the article!!

  • Re:Neat insides (Score:5, Informative)

    by b0r1s ( 170449 ) on Tuesday September 06, 2005 @11:13AM (#13489953) Homepage
    These are neat little boxes - we've managed 2 (the yellow appliance, and the blue mini appliance), and the performance of both was pretty nice.

    The tools google provides (very easy binary updates, strong web control panel, for example) turn the relatively common task into a dead-simple, point-and-click configuration.

    They even provide a decent interface for skinning the search pages, and while it's not perfect, it's certainly adequate for even the best looking sites on the internet.
  • It's "its"! (Score:5, Informative)

    by dtmos ( 447842 ) on Tuesday September 06, 2005 @11:21AM (#13490029)
    The guys from anandtech put it through
    it's paces

    It's really easy: It's "his", hers", and "its". Even a flower [angryflower.com] knows!

    --cycling through grammar Nazi mode. Please wait.

  • Re:where's the raid? (Score:5, Informative)

    by slim ( 1652 ) <john.hartnup@net> on Tuesday September 06, 2005 @11:32AM (#13490109) Homepage
    I guess if you want RAID, you pay more than $3,000.

    What you're really buying here is closed-source software, wrapped in the hardware that turns it into an "appliance". Assume $2,000 of that $3,000 pays for the software.

    By specifying the hardware in this way, and by keeping the BIOS and root passwords to themselves, Google greatly simplify their support role.

    This is common practice: an IBM HMC (Hardware Management Console) is a 1U PC with a custom Linux distribution and the management software preinstalled. You don't get the root password; you just use the software as delivered.
  • by Anonymous Coward on Tuesday September 06, 2005 @11:34AM (#13490133)
    I can search the 63,000 online documents with http://www.google.com/search?q=site:www.anandtech. com
  • Re:It's "its"! (Score:4, Informative)

    by Traa ( 158207 ) on Tuesday September 06, 2005 @11:36AM (#13490145) Homepage Journal
    Use "it's" when you can replace it with "it is"

    Well, that is what someone told me anyway. English is not my primary language, if the above is not correct then please don't shoot me.
  • Re:GPl compliance (Score:3, Informative)

    by Anonymous Coward on Tuesday September 06, 2005 @11:42AM (#13490201)
  • by Homicide ( 25337 ) on Tuesday September 06, 2005 @11:44AM (#13490207) Homepage
    I admin a full blown Google Search Appliance, the mimi's big brother.

    If you want the specs:
    Dual Xeon 2.6GHz
    12GB RAM
    4 250GB HD's in RAID(something) with a hot-swap spare.

    Never tried taking off the cover though, since we want to keep the warranty.

    All of the money you pay is a license for the software on the box, the system itself is effectively free, so once the 2 year warranty expires, you've effectively got a nice powerful linux box for free. You can keep running the software, but without any support.

    As for performance, this thing works great, we have about 250,000 pages that it can index, both public and private (and it can do searches cleverly checknig username/pasword to see if you should have access to certain results), and we've had nothing but positive responses from our users. The results come up quickly, they're the results people want, and the results that management think should be at the top, are at the top.
  • by msblack ( 191749 ) on Tuesday September 06, 2005 @11:53AM (#13490277)
    We evaluated on of those yellow Google search appliances (GSA) and experienced very mixed results. The appliance is very easy to set-up and launch an initial scan of our website.

    The GSA will blindly search all web servers in your domain. When setting-up the GSA, you give it an initial page from which to start crawling and baseline domains. For example:

    Inital page: http://www.slashdot.org/ [slashdot.org]
    Domain(s): .slashdot.org,slashdot.org

    The leading dot on the first domain entry says to search all hosts in the domain.

    Problem: GSA does not provide very good status of where or what it is searching. It only has a dashboard light to say it is crawling. No details.

    Problem: We found that the GSA would get caught in an endless loop if it encountered a user website controlled by a database. It would endlessly follow the next and previous links to find every database entry.

    Our university library subscribes to a number of electronic databases, such as, EBSCO PsychINFO, etc. The GSA indexed every possible look-up.

    Our eval licenses was limited to 1.5 million pages. Some of these databases contain hundreds of thousands of pages. Solution: Those setting up their own web server must employ proper robots.txt files or risk having their entire server blocked from indexing.

  • by Augusto ( 12068 ) on Tuesday September 06, 2005 @11:56AM (#13490293) Homepage
    The problem is not google, is the way your app is designed!

    Universal Resource Identifiers -- Axioms of Web Architecture : Identity, State and GET [w3.org]

    In HTTP, GET must not have side effects.

    In HTTP, anything which does not have side-effects should use GET

    If somebody visited your site with a pre-fetching tool like the google web accelerator, you will also find the "delete" button being checked automatically like this. Change those deletes to use POST instead.
  • by Homicide ( 25337 ) on Tuesday September 06, 2005 @11:59AM (#13490315) Homepage
    It submits a HTTP HEAD request for the URL to the server the page is on, with the username and password supplied, so the server at the other end decides if you should be able to see the search results, thus saving you from having to faff around telling the google box who can get to what pages.
  • by Anonymous Coward on Tuesday September 06, 2005 @12:28PM (#13490603)
    ... which flows right into this statement:
    A word to the wise: don't let the Mini crawl your entire site without keeping a close eye on it.

    The same could be said of any search engine, or any automated process for that matter. We use ht://Dig and the issues are the same, except ht://Dig can be run locally on the server, saving bandwidth (and speeding up the indexing process) by indexing locally and re-writing urls for static files, through apache for dynamic, it's free, and you aren't limited to 100000 documents. It supports the same feature set, minus the Google Gui.

    Of course, it does have a steeper learning curve... you actually need to understand how search, url filters, regex, synonyms, etc works.

    I'd provide screenies, but most people glaze over when confronted with terminal output ; ) A shell just isn't as hip as an html gui. What else can I say?

    L8,
    AC

  • by macshome ( 818789 ) on Tuesday September 06, 2005 @01:12PM (#13491007) Homepage
    Why did this get marked troll?

    According to Google, they do use pigeons [google.com].
  • Re:It's "its"! (Score:4, Informative)

    by radishes ( 904663 ) on Tuesday September 06, 2005 @01:29PM (#13491184)

    and use its' when it's possesive

    john's coming to get johns' hat

    Don't listen to this guy. He has lied to you twice. 1) Its' is never valid. 2) The example with John is just so wrong it hurts. "John is coming to get John's hat." You use 's for possessive; s' is for possessive plural, like this: "Slashdotters tend to live in their parents' basement."

  • by Anonymous Coward on Tuesday September 06, 2005 @02:54PM (#13492014)
    Our experience in evaluating the google machine for a large (100+ Million hit/day) site was less than positive. We'd have needed over 40 of their regular boxes to supply the search results, and there is no built-in cluster management. Since there is no access to the filesystem, this means we need to write the tool to interact with their web-based gui, and if they change bits with an automatic software update, too bad for us :-(

    Needless to say, we declined. Results and response times were pretty good though.
  • by LordBlackadder ( 913035 ) on Wednesday September 07, 2005 @12:18AM (#13496796)
    Google has many production quality problems with its distributor. I had to return 2 units before I received a functioning unit the 3rd time. I benchmarked the functioning Google Mini the other day. I havent published detailed results yet, but I can tell you that the performance was very poor considering the performance expection from a brand like Google. While I think the appliance is very capable, neither the Google Mini nor the larger yellow appliance are suitable for wide enterprise deployment. I benchmarked the Mini at an average of only 3 transactions per second. Max of 7 TPS, Min of 1 TPS0. Load balancing with 2 boxes only increased speed of transaction time by ~30%. My company of 100,000+ users certainly can't use a system at this performance. I don't think my workgroup of 20+ people will be able to use it productively. We bought the box, but I think it will stay in the closet for limited uses. It has potential for h4xng with processor/mem upgrades - maybe even dd to new hardware. But until Google concentrates on appliance performance, their "Google Enterprise" initiative won't be taken seriously by the target market.
  • by guacamolefoo ( 577448 ) on Wednesday September 07, 2005 @10:38AM (#13499607) Homepage Journal
    I seriously considered getting a Google Mini for my law office. The desktop search stuff wasn't really doing it for us, and we have boatloads of work that we reuse on a regular basis -- pleadings/contracts/settlement agreements, etc. are sort of like code in that respect -- we always want to reuse our knowledge rather than reinventing the wheel. My concern was that the regular Google appliance was too expensive. The mini seemed reasonable, but I still was resisting the idea of paying that much for search.

    In any case, I had searched high and low for a decent search function when I happened upon swish-e. I am exceptionally pleased with it. It can be found at swish-e.org [swish-e.org].

    I am not an uber geek, but I was capable of spending an afternoon monkeying with it to install it, set up regular indexing as a cron job, get it to properly read and index OpenOffice documents, and to launch them from the browser. This involved some frightening security settings, but I have a small enough office (three people) that I'm not too torqued about this. The wide open settings I used were not swish-e's fault, as near as I could tell. Rather, they resulted from my laziness -- "It works well enough now, and the likelihood of malicious use is pretty low, so fuck it".

    Obviously, it could be set up a bit more cleanly on my end, but I am really, really happy with it apart from that. Currently, it runs on a used SCSI-RAIDed IBM Netfinity box that I picked up for a little under $500.

    The time and money I spent on the hardware plus getting it running has paid immense dividends. I have benefitted in two primary ways:

    First: my office minions use the network for storage and do not store anything locally. This means that everything is indexed (and can be found!) and because they like the search so much, they also (unwittingly, perhaps) give me the peace of mind knowing that our data also gets the other benefits of being on the network (everything is backed up automatically/regularly, etc.).

    They like being able to find stuff, so the search has really encouraged saving stuff on the network. I could mandate this in other ways, but I'd rather have them drinking my Kool Aid than simply imposing the idea.

    Second: My minions and I have saved tons of time using the search feature. Any good search does that. The additional bonus is that I no longer have to worry about the next version of Google Desktop or Copernic or installing it on various machines, blah, blah, blah. It's all centrally saved and configured. Administration is essentially zero since I am getting good search results on all the document types that I need - some old MS Office leftovers, Open Office, and PDF.

    I don't see needing to change this in any significant way for at least as long as I keep the hardware. I think that the next time I'll need to touch it will be when the index outgrows the box serving the searches.

    The box I'm running has dual 1.something gig pentiums with a gig of RAM. The drives are the weak link, with only 9.1 GB of space available for storage of OS, index, etc. The box also has redundant power supplies, redundant power supplies , redundant ethernet connections (100MB), and redundant ethernet connections (100MB).

    The front end to the search is just a standard, "came with it" CGI script (swish.cgi). It works just fine. It gets called up as a webpage locally, and it spits our results.

    On a final note, we are pretty aggressive in enforcing standardized file naming conventions. The naming conventions typically include te client name, the matter, a date, the type of document, and the subject of the document. Swish-e has document path, title, title and body searches off the interface we use, and you'll usually find exactly what you're looking for if you're reasonably specific.

    On a final note, swish-e has been unsuccessful when I have used the following search terms "nubile blonde woman" and "willing to get with me". In that respect, swish-e has been an outright failure, though it is conceivable that the fault lies with operator error.

    GF.

New York... when civilization falls apart, remember, we were way ahead of you. - David Letterman

Working...