Library of Congress Offers Update On Huge Twitter Archive Project 88
Nerval's Lobster writes "Back in April 2010, the Library of Congress agreed to archive four years' worth of public Tweets. Even by the standards of the nation's most famous research library, the goal was an ambitious one. The librarians needed to build a sustainable system for receiving and preserving an enormous number of Tweets, then organize that dataset by date. At the time, Twitter also agreed to provide future public Tweets to the Library under the same terms, meaning any system would need the ability to scale up to epic size. The resulting archive is around 300 TB in size. But there's still a huge challenge: the Library needs to make that huge dataset accessible to researchers in a way they can actually use. Right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute, which limits researchers' ability to do work in a timely way."
Why? (Score:5, Insightful)
Why does the federal government need to archive the useless information twitter calls tweets .. yet another huge wast of my money (being a taxpayer and all)
Re: (Score:1)
Because we desperately need to know that little Susie just ate some pizza and finished taking a shit 5 minutes ago.
Oblig... (Score:2)
So how big, in Libraries of Congresses, is the archive that they're adding from Twitter, to said Library of Congress?
Re: (Score:1)
So in hindsight (Score:2)
only 0.6 Libraries of Congress.
Re:Why? (Score:5, Insightful)
To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."
Re:Why? (Score:5, Insightful)
To paraphrase a quote by the Internet Archive chairman from some years back, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."
The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
Re: (Score:2)
mod parent up.
Re:Why? or lifespan of conversations (Score:1)
The average life of an inane conversation used to be maybe 15 minutes. I'm not sure the world is a better place for having extended that.
In the old days of USENET, conversation threads used to run for weeks, sometimes months, actually.
Not minutes.
of course, back then, we actually knew who everyone was, and could ping and finger them.
Re: (Score:2)
Teen angst and celebrity gossip are considered culture, but popular movies and music are not. American society at its finest!
Re:Why? (Score:5, Interesting)
Because academia is starved for data. Companies hoarding information limits what we can do with it. The library of congress is acting as an aggregate buyer for thousands of individual researchers, it is a huge cost savings.
Re: (Score:2)
My first reaction was "no, please, don't encourage the twits."
Re:Why? (Score:5, Insightful)
Because Twitter is a great model for the spread of ideas. If you study the spread of ideas, you can begin to understand it and use that understanding to affect it. That has enormous value.
Re: (Score:2)
"Because Twitter is a great model for the spread of ideas."
Indeed. We'll have a treasure trove of racist/bigot/whatever messages from 20-30 years ago, when they were young and dumb, for every candidate we are going to vote for.
Re: (Score:2)
300TB of storage can be built for less than $100k these days. Far from a "huge" waste of money. Though given the value of most Twitter posts, it's still probably a waste of $99,500.
Re: (Score:2)
Like I said, I question the value, but the cost just isn't that much. People seem to think 300TB is a big number to store or manage, but it's really not any more.
And honestly, I relish the day when all of the teens and twentysomethings get older and start running companies or running for political office and rather than guess at their *real* past lives we can just search for all of the idiotic/offensive/racist comments they made over the years...
Re: (Score:3)
Because tweets aren't useless - they're as much a part of societies communications as post cards, phone calls, etc... etc... There's a lot of information there about the day-to-day interests and communications patterns of a lot of ordinary people.
For a historian or a sociologist, that archive is going to be a gold mine.
I wonder... (Score:1)
Is there limitation hardware or software? Where is the bottleneck?
Just give me a csv.
Re: (Score:2)
Is there limitation hardware or software? Where is the bottleneck?
Just give me a csv.
Probably a simple hashing routine would cut down on the size 1 = LOL, 10 = ROFL, 11 = ROFLMAO, ...
Re: (Score:2)
Or just use some level of solid compression [wikipedia.org] and the problem solves itself.
Bottleneck.... (Score:2)
narrow it down? (Score:2)
provide a limited version of the database with only some information from the tweets, so there's less data to search through? (of course, keep the full data in case a search depends on it)
Re: (Score:1)
Re: (Score:2)
I meant providing all the tweets in a simpler form, as opposed to excluding some of the tweets entirely, but I suppose it would make sense to at least test on a small subset of tweets first.
Re: (Score:2)
Re: (Score:2)
with only some information from the tweets
That's a really good idea. Hell, that would probably make their whole project a couple of megabytes!
Re: (Score:2)
Done!
Why not? (Score:1)
Just buy batches of 300 of those 1Tb flash drives in the article below and pass them out to the researches as needed?
My goodness 24 hours? (Score:2)
Re: (Score:2)
Indexing on data sets of that size is itself a pretty big challenge. You don't want an index that takes years to build, and it doesn't do much good if it's so huge that it is itself super-slow to access.
There is some research [pdf] [helsinki.fi] on making compressed full-text indexes, but much of it is still research-level.
Re: (Score:1)
Deduplication ought to help too.
He who archives my tweets (Score:1)
Archives trash.
Really, why not record and archive random traffic sounds? Some day when everyone is flitting about in whisper quiet air cars they'll marvel at the cacophony of the present age. Gadzooks!
Re:He who archives my tweets (Score:5, Interesting)
Some of the most important historical knowledge comes from things that people at the time wouldn't consider important. Things like grocery lists can help determine the diets and agricultural abilities of a culture at the time.
For an example I just made up: In the future, the presence or lack of traffic reports could, alongside legal/budget records, help a historian verify the spread/development of roadways.
Twitter could be a huge source of topics and a wealth of information for historians in the future.
They may conclude that we were all idiots. This too, counts as useful information.
Re: (Score:2)
Re: (Score:2)
Why is parent modded to 0? Storing them on any number of services cloud services would be a lot cheaper than building their own system. Amazon and Google already host public datasets [amazon.com] for researchers over 300tb. Hell, they could just agree to pay Twitter a service fee for data and keep offline tape backups. While we are at it, why not maintain a Torrent of each year?
Re: (Score:1)
They aren't. AC posts always start at 0. Welcome to Slashdot.
Re: (Score:2)
Oh, right. I would like to point out my old-skoolish 6-digit uid!
Down the Terlit (Score:1)
Re: (Score:2)
It's illegal to make a copy of any of those other things though.
Re: (Score:2)
Re: (Score:2)
It has its place.
Re: (Score:3)
Re: (Score:2)
Stuck in a loop here.... (Score:4, Funny)
So, just how many 'Libraries of Congress' are there in 300TB? ;-)
Does this mean that as the archives swell, the metric does also?
Where does this madness end?
Unit conversion (Score:2)
1 Library of Congress [wikipedia.org] ~ 10Tb of data
Therefore, the database will be around 30 LoCs in size.
But, if we consider this database as part of the Library of Congress, we get a fixpoint problem..
Re: (Score:2)
The Library's Mission Statement is ... (Score:2)
It's not much of a step from there to archiving all the phone conversations of all Americans
Re: (Score:2)
Re: (Score:2)
No, I don't see how archiving Twits and tweets furthers this mission *at all.*
And in what way does that matter? After all, Congress, the President and the Supreme Court don't follow the Constitution, why should we expect any other bureaucracy to do what they're supposed to?
Re: (Score:2)
...wow. They actually do follow the constitution. But please, enlighten us with your Intro to Law class....
I'm pretty sure the constitution doesn't say that you can forget it if you rent a piece of land and station people on it.
and how can everyone get to bear arms if bears are a protected species..
seriously? (Score:2, Insightful)
300TB worth of tweets, which are basically very small text files? A single tweet, that uses all available character should only be 140 bytes. I just refuse to believe that there is 2+ trillions tweets out there, to make up 280+TB. Considering 1 billion tweets would be 140GB. (unless I'm failing massively at math here, which is quite possible.)
Re: (Score:2)
but if you dump the data associated to the tweet as well...
Re: (Score:1)
https://dev.twitter.com/docs/counting-characters [twitter.com]
Obviously no programmers on right now (Score:2)
Look, I don't know about you, but we process hundreds of TB of data when we process genomes, using this fancy stuff called "databases", "hash indexing", and fancy software that may be hard for you to find like Perl, C, and various scripting languages.
It's fairly simple coding. Just build an index hash from keywords (which are all preceded by #), add another index by words (ignoring all the bit.ly and other web links), add a third index by @ reference (aka user names, which are really just a 20 character par
Data Relevance (Score:2)
Percentage of Americans with Accounts:
Twitter: 13%
Facebook: 70%
So there is FAR less diversity, and extremely poor quality data, why did they not archive public Facebook posts instead?
I see it as, facebook hosts people who write articles, stories, poems, songs, music, pictures, etc. THAT is the point of the Library of Congress: Documenting and Preserving Culture. Not trying to datamine the history behind "WAT R U DOIN FRI GRRL?",
Kwic! Someone set us up the index! (Score:2)
How many of those are RTs? (Score:1)
A substantial number of posts are literal duplicates by known spambots.
You could store those separately as well as the Retweets (RTs).
Then, think about what typically gets posted.
Most might be something like 520,000 variations on "Touchdown!" or "That's gotta hurt!" during sporting events, or "It's snowing!"
A lot of the rest are probably repeats of what someone just said on Comedy Network or during a TV program. They will all be at about the same time in a region and be substantially the same thing, with 5
300TB is about right (Score:2)
300TB is about right. Twitter says they have 400 million tweets per day. Figure about 500 bytes per message with text, and metadata (source, destination, timestamp, flags). 400,000,000 msgs/day * 365*4 days * 500 bytes = 292,000,000,000,000 bytes.
Twitter offers a feed of 1 in 10,000 public tweets, so you can see how banal it is. I had a program monitoring that for a while, extracting links and evaluating them for spam. It's about as bad as you'd expect.
Re: (Score:1)
https://dev.twitter.com/docs/counting-characters
plus, what kind of consipracy theory is it that the library of congress would want to willfully deceive us about the size of a twitter archive.......
Re: (Score:1)
and how do you say it wont make a difference to the statistics? 140 characters, multiplied by a -potential- 2, maybe 3 bytes per character, that absolutely increases the size! they aren't magical letters.
when the data is available, feel free to run stats on tweet size, i'm sure it will be boring. the conspiracy theory was a joke. point being, who cares how big it is. im pissed that they spent
Why? (Score:2)
Why? Why does it take so long?
They talk about the hardware and software not being up to scratch, but many other companies seem to be able to process huge amounts of data quickly. Google, for one, seems to do it.