Quantcast File System to Open Source
"Doubles" storage efficiency for Hadoop processing.
This is a Press Release edited by StorageNewsletter.com on October 10, 2012 at 2:27 pmQuantcast Corp. announced the Quantcast File System (QFS) to open source.
Evolved from the Kosmos Distributed File System (KFS, also known as CloudStore), QFS offers a higher performance alternative to the Hadoop Data File System (HDFS) for batch data processing, improving data IO speeds and halving the disk space required to reliably store massive data sets. Integrable with Apache Hadoop, QFS has been live at Quantcast for four years, reliably handling petabyte-scale production workloads.
QFS/HDFS Comparison
The pioneer of direct audience measurement and a provider of real-time audience targeting, Quantcast enables online marketers and publishers to identify key audiences and create relevant advertising and more meaningful interactions with consumers. The company directly measures more than 100 million web destinations, collects well in excess of 500 billion new data records per month and, using QFS as its primary data store, exceeds 20 petabytes of daily processing.
"Quantcast houses one of the world’s largest data sets and understands the challenges and opportunities of big data in a way that few organizations can. The sheer size and scale of its data set easily positions the company among the top five data processing organizations in the world, and its success leveraging QFS in production offers a compelling use case," said Ben Woo, principal analyst and MD at Neuralytix, Inc. "With the inherent ability to more efficiently store, manage and process data at multi-petabyte scale, QFS will be attractive to technology savvy organizations that require greater storage efficiency and performance from their Hadoop clusters. The project has the potential to be one of the most significant contributions to the open source community since the advent of Hadoop and HDFS."
KFS was designed to provide high-performance backend storage infrastructure for batch processing of large data sets. Released to the open source community in 2007, KFS demonstrated the advantages that a native distributed file system implementation similar to HDFS could deliver, but was fundamentally experimental and insufficiently stable for production usage.
"I am thrilled that the file system the Quantcast team and I built has reached this level of maturity. As we discovered, achieving production reliability at a multi-petabyte scale on thousands of nodes is a long, difficult effort," said Sriram Rao, originator of the KFS project and principal scientist lead at Microsoft Corp. "It’s a rare open source project that provides this much value and has such a track record in a tough production environment. I’m delighted the broader community will be able to benefit and continue to evolve it."
Quantcast adopted KFS in 2008 and QFS emerged from its work to extend the platform’s performance, efficiency and management features to support daily production use. The company has operated with QFS as its primary production file system for over one year, during which time it has handled more than 4 exabytes of IO. QFS’s performance improvements are achieved while simultaneously reducing disk storage requirements by 50% as compared to HDFS.
The QFS 1.0 release has enterprise features that include:
- Fast implementation in C++ with highly optimized metaserver (name node).
- Feedback-directed chunk allocation directs new data chunks to drives based on their available space and dynamic workload.
- Coordinated restarts allow the metaserver to distinguish planned maintenance from node failure and invoke data recovery procedures only for the latter.
- Unix-style permissions control access to files based on user and group identity.
- User logging keeps file-level creation and access records.
- Direct I/O optimizes disk access for highest performance.
- Fixed-footprint memory management keeps memory usage stable, performance consistent, and cluster nodes easy to administer.
- File replication automatically maintains multiple copies of data when desired.
- Concurrent append allows multiple processes to write efficiently to one file, enabling high volume logging and efficient sort algorithms.
- Reed-Solomon error correction ensures data availability with only 50% data expansion (compared to 200% for HDFS).
- Hadoop compatibility plug-in makes QFS compatible with Apache Hadoop, offering a performance alternative for HDFS.
"In our big data future, file systems such as QFS will underpin cost-effective critical infrastructure for commerce and government. Just as performance and cost efficiency are key attributes of a file system, so are integrity and reliability and we believe that the open source community is the most effective and sustainable path to dependable, enduring file system software," said Konrad Feldman, CEO at Quantcast. "Quantcast makes use of open source software and by making our own contribution with QFS we’re hopeful that others will benefit as we have, and that community collaboration will enable QFS to meet the production demands of big data environments for years to come."
About QFS
It is an open-source distributed file system that offers efficiency and cost-effectiveness for large Hadoop and other batch-processing environments. Through Reed-Solomon error correction, finely optimized disk and memory management in C++, load balancing and production management features refined on Quantcast’s petabyte-scale workloads, QFS achieves significant performance improvements versus Hadoop’s standard HDFS, while also reducing the storage footprint by 50%. QFS is plug-in compatible with Apache Hadoop and is available for free download under the Apache 2 license.