British Library Launches UK Web Archive

The British Library has launched a web archive designed to preserve pages from UK web domains, much as the library preserves a physical archive of British Books and other publications. The system – which includes the open source Hadoop software – has been built by IBM.

The archive, to be launched this evening, will include special sites gathering together web material referring to specific subjects, including the Credit Crunch and Anthony Gormley’s Trafalgar Square “Fourth Plinth” project from  summer 2009. A site covering this year’s general election is already planned.

“Fifteen petabyte of data is created daily,” said David Boloker, IBM’s chief technology officer for emerging Internet technologies – and much of this data is never saved or stored, he said. Ten percent of UK websites disappear within six months, and the average life expectancy of data online is around 44 to 75 days. the BL project c;aims

The project follows at least six years of work by the British Library – mostly on the practicality and legality (under copyright law) of copying UK web data.

Since 2004, the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online,” said  British Library chief executive, Dame Lynne Brindley. “Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.”

The library has been working to extend the Legal Deposit requirement, under which publications must give a copy of each issue to the library, and there are plans to extend the  requirement to online material. “Limited by the existing legal position, at the current rate it will be feasible to collect just one percent of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend the provision of legal deposit through regulation to cover freely available UK websites, providing regular snapshots of the free UK web domain for the benefit of future research.”

The UK web domain currently houses eight million sites, and is rapidly expanding, providing a continuously updated archive of social and cultural issues in Britain, the Library said. “Despite common misperceptions, material that is freely available on the web is still subject to copyright and cannot be archived without permission,” said Dame Lynne, “a time consuming, expensive, and often impossible task.”

The archive uses BigSheets, a mashup project IBM has built, based on the open source Hadoop project from Apache. Hadoop is also in use by public cloud services such as Amazon’s, and distributed by Yahoo!

The BigSheets prototype is in use within six organisatsion worldwide, said Boloker – the others include commercial organisations in fields such as pharmaceuticals. It is designed to handle unstructured data from Web-based repositories; which it “enriches” with what IBM describes as an “unstructured information management architecture”, supporting things like tag clouds.

Peter Judge

Peter Judge has been involved with tech B2B publishing in the UK for many years, working at Ziff-Davis, ZDNet, IDG and Reed. His main interests are networking security, mobility and cloud

View Comments

  • To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.

    Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/

Recent Posts

Craig Wright Sentenced For Contempt Of Court

Suspended prison sentence for Craig Wright for “flagrant breach” of court order, after his false…

2 days ago

El Salvador To Sell Or Discontinue Bitcoin Wallet, After IMF Deal

Cash-strapped south American country agrees to sell or discontinue its national Bitcoin wallet after signing…

2 days ago

UK’s ICO Labels Google ‘Irresponsible’ For Tracking Change

Google's change will allow advertisers to track customers' digital “fingerprints”, but UK data protection watchdog…

2 days ago

EU Publishes iOS Interoperability Plans

European Commission publishes preliminary instructions to Apple on how to open up iOS to rivals,…

3 days ago

Momeni Convicted In Bob Lee Murder

San Francisco jury finds Nima Momeni guilty of second-degree murder of Cash App founder Bob…

3 days ago