The British Library has launched a web archive designed to preserve pages from UK web domains, much as the library preserves a physical archive of British Books and other publications. The system – which includes the open source Hadoop software – has been built by IBM.
The archive, to be launched this evening, will include special sites gathering together web material referring to specific subjects, including the Credit Crunch and Anthony Gormley’s Trafalgar Square “Fourth Plinth” project from summer 2009. A site covering this year’s general election is already planned.
“Fifteen petabyte of data is created daily,” said David Boloker, IBM’s chief technology officer for emerging Internet technologies – and much of this data is never saved or stored, he said. Ten percent of UK websites disappear within six months, and the average life expectancy of data online is around 44 to 75 days. the BL project c;aims
The project follows at least six years of work by the British Library – mostly on the practicality and legality (under copyright law) of copying UK web data.
Since 2004, the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online,” said British Library chief executive, Dame Lynne Brindley. “Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.”
The library has been working to extend the Legal Deposit requirement, under which publications must give a copy of each issue to the library, and there are plans to extend the requirement to online material. “Limited by the existing legal position, at the current rate it will be feasible to collect just one percent of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend the provision of legal deposit through regulation to cover freely available UK websites, providing regular snapshots of the free UK web domain for the benefit of future research.”
The UK web domain currently houses eight million sites, and is rapidly expanding, providing a continuously updated archive of social and cultural issues in Britain, the Library said. “Despite common misperceptions, material that is freely available on the web is still subject to copyright and cannot be archived without permission,” said Dame Lynne, “a time consuming, expensive, and often impossible task.”
The archive uses BigSheets, a mashup project IBM has built, based on the open source Hadoop project from Apache. Hadoop is also in use by public cloud services such as Amazon’s, and distributed by Yahoo!
Suspended prison sentence for Craig Wright for “flagrant breach” of court order, after his false…
Cash-strapped south American country agrees to sell or discontinue its national Bitcoin wallet after signing…
Google's change will allow advertisers to track customers' digital “fingerprints”, but UK data protection watchdog…
Welcome to Silicon In Focus Podcast: Tech in 2025! Join Steven Webb, UK Chief Technology…
European Commission publishes preliminary instructions to Apple on how to open up iOS to rivals,…
San Francisco jury finds Nima Momeni guilty of second-degree murder of Cash App founder Bob…
View Comments
To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.
Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/