Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Continue Shopping or See your cart

Item Description

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:

  • Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce
  • Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
  • Use Pig, a high-level query language for large-scale data processing
  • Take advantage of HBase, Hadoop's database for structured and semi-structured data
  • Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems

If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject.

"Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

Product Details

  • Author: Tom White
  • Publication Date: 2009-06-05
  • Publisher: O'Reilly Media
  • Product Group: Book
  • Manufacturer: O'Reilly Media
  • Binding: Paperback, 528 pages
  • Features:
    • ISBN13: 9780596521974
    • Condition: New
    • Notes: BUY WITH CONFIDENCE, Over one million books sold! 98% Positive feedback. Compare our books, prices and service to the competition. 100% Satisfaction Guaranteed
  • Package Dimensions:
    • Dimensions: 910L x 700W x 110H
    • Weight: 155
  • List Price: $44.99
  • ISBN: 0596521979
  • ASIN: 0596521979

Buying Options

Sold by bestever: Usually ships in 1-2 business days

Similar Items

Customer Reviews

Average Amazon User Rating: 4.0 stars

5 stars Brilliant book to get started and keep going 2010-05-19

Reviewer: Simon Reavely

I really enjoyed the book. It has everything you need to:
a) Get started running your own cluster and writing your own MR jobs
b) Understand how to administer the cluster
c) Troubleshoot your programs
d) Learn about really important side projects like Pig, Hive, Zookeeper and HBase (of which I think Hive is the most amazing)

One thing I wish I'd done is go through the cloudera online tutorials BEFORE reading this book. If I'd done that (instead of doing so afterwards) I think I'd have got through certain sections of the book much quicker; basically I would have 'got it' quicker. See [...]

4 stars The elephant is tamed 2010-04-30

Reviewer: JUG Lugano

Original review written by Paolo Canesi, JUG Lugano, www.juglugano.ch

Managing and analyzing huge data sets has become a very common problem in various areas of modern information technology, from different types of Web applications (social, financial, trading, ...) to applications for analyzing scientific data.

Distributed systems over a cluster of machines are almost a mandatory choice in such cases, but designing and implementing an effective solution in those areas may be troublesome and become a nightmare.

The Apache Hadoop Project is an infrastructure that helps the construction of reliable, scalable, distributed systems. Mainly known for its MapReduce and distributed file system (HDFS) subprojects, it actually includes other services that complement or extend them.

Tom Whites' "Hadoop: The Definitive Guide" is an enjoyable book which fully explains these complex technologies. The book is organized in such a way that the reader is gently guided into the Hadoop ecosystem. It begins with a couple of very readable chapters as a general introduction to the problems Hadoop is meant to solve and the main solutions to them (MapReduce and HDFS), then examines closely all its aspects, often describing what really happens under the scenes, giving useful design suggestions and common pitfalls descriptions. When reading this book you won't be overwhelmed by tons of lines of code: examples are short and yet effective.

This kind of structure makes it hard to classify the book as a mere tutorial or as a real reference guide, it can be rather considered a mix of the two. If this turns out to be a positive choice in many ways, it has some drawbacks: the reader is sometimes forced to go back and forth through the chapters and has to read it almost entirely to get a full understanding. But this is perhaps the price to pay for having a fluent and pleasant reading.

Let's go quickly through the chapters:

The first chapter is a brief history of Hadoop project illustrating its main characteristics and comparing them to those of others similar technologies. Chapter two is a pleasant introduction to MapReduce. The third chapter breaks the continuity of the previous one examining the Hadoop Distributed File System (HDFS subproject) in detail. Chapter four makes a step down in the abstraction layer talking about the Hadoop I/O fundamentals: data integrity, compression, serialization and data structures, explaining the design choice.

Chapters five to eight are an excellent source for learning Hadoop MapReduce in depth. They cover all the aspects of it: starting from practical ones, such as how to configure, run, test and debug map reduce programs, to those more advanced and formal, like programming models, data formats, sorting and joining tools.

The two following chapters list few very interesting and useful suggestions for managing and setting up a Hadoop cluster, a precious resource for administrators.

Chapters eleven to thirteen are for Pig, HBase and Zookeper subprojects under the Hadoop umbrella. Despite of suffering from brevity, they are still interesting.

Chapter fourteen is made for the reader not to feel alone: important case studies using Hadoop (e.g. Yahoo, and others contributions from Apache Hadoop community).

My final opinion is that "Hadoop: The Definitive Guide" is a very useful resource for those who want to learn how to ride the "pachydermic" Hadoop (like a "Mahout", perhaps?).

3 stars Partly succeeds 2009-09-08

Reviewer: BillyJoeBob

Tom White certainly writes very well: this book is very readable. It is also quite comprehensive, falling somewhere between a tutorial and a reference.

That being said, I was ultimately rather disappointed. First, and most importantly, it was not clear to me after reading this book how I might use Hadoop for some of my projects, or if indeed they were good candidates for MapReduce. I feel it should have been possible to provide some generic guidance. Second, some chapters are written by other authors, and these did not uniformly provide the same quality of instruction, reading occasionally like advertisements.

I confess I am puzzled by the number of encapsulating and utility APIs that have grown up around Hadoop. Why do we need Pig, HBase, Hive, Zookeeper and Cascading? Apparently because (according to what I have read here), bare Hadoop is hard to program with (productively). Some indication of how these wrappers interact with each other would have been helpful.

As it is, I feel LESS urge to evangelize for Hadoop having read this book. Surely not the desired effect?

5 stars Very comprehensive book 2009-08-31

Reviewer: Philippe Nicolas

I bookmarked this book for several months and bought it very rapidly after its availibility. It's a very comprehensive book, very deep and cover many various aspects of Hadoop and related technologies. I recommend it without any doubt, enjoy reading and learning.

5 stars First 25 Pages Have You Up And Running! 2009-08-24

Reviewer: Jonathan Zdziarski

I picked up this book to catch up on Hadoop, which the rest of my team has been using for several months. Unfortunately I was too busy with other projects to spend any time on MapReduce and thought it'd be a grueling process to be brought up to speed on it. Within the first 25 pages and about 3 hours, Tom had me up and running my first MapReduce job which I successfully adapted for a specific metric we were trying to generate. The book does a great job of breaking down Hadoop's complex pieces into easy to understand components, but doesn't try and pump you full of conceptual BS before it lets you touch real code.

If I were to make any suggestions it would be to start the book off with some simple instructions for installing and getting Hadoop up and running on a local machine, followed by some simple explanations of DFS and Hadoop's commands for managing the file system. I would also explain much earlier how to get your classes recognized by Hadoop for those a bit rusty at Java. Fortunately, the online Wiki was very good about providing instructions to get me going on a Mac, and that took a majority of OS-specific needs off the burden of the book. You will, no doubt, have to be intelligent to read this book, but if you're using Hadoop, there is already a prerequisite for technical proficiency you'll need to satisfy. Overall good job, Tom.