Hadoop: The Definitive Guide

Language: English

Pages: 756

ISBN: 1491901632

Format: PDF / Kindle (mobi) / ePub

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

Beginning SQL Server for Developers (4th Edition)

Typography Best Practices

Scalable and Modular Architecture for CSS

Perl Graphics Programming: Creating SVG, SWF (Flash), JPEG, and PNG Files with Perl

Set runs the risk of running out of memory if there are many values for a certain key. This hasn’t happened in practice, but to overcome this, an extra MapReduce step could be introduced to remove all the duplicate values or a secondary sort could be used. (For more details, see “Secondary Sort” on page 227.) public void reduce(IntWritable trackId, Iterator values, OutputCollector output, Reporter reporter) throws IOException { Set userIds

make this function even more generic, but they are covered in the Cascading User Guide. Example 14-4. Extending word count and sort with a SubAssembly Scheme sourceScheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputPath); Scheme sinkScheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputPath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); assembly = new ParseWordsAssembly(assembly); assembly = new

class MaxTemperatureMapper : public HadoopPipes::Mapper { public: MaxTemperatureMapper(HadoopPipes::TaskContext& context) { 36 | Chapter 2: MapReduce } void map(HadoopPipes::MapContext& context) { std::string line = context.getInputValue(); std::string year = line.substr(15, 4); std::string airTemperature = line.substr(87, 5); std::string q = line.substr(92, 1); if (airTemperature != "+9999" && (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) { context.emit(year,

getLength(): Text t = new Text("hadoop"); t.set(new Text("pig")); assertThat(t.getLength(), is(3)); assertThat("Byte length not shortened", t.getBytes().length, is(6)); This shows why it is imperative that you always call getLength() when calling getBytes(), so you know how much of the byte array is valid data. Resorting to String. Text doesn’t have as rich an API for manipulating strings as java.lang.String, so in many cases, you need to convert the Text object to a String. This is done

objects, so it necessarily has some overhead for serialization and deserialization operations. What’s more, the deserialization procedure creates a new instance for each object deserialized from the stream. Writable objects, on the other hand, can be (and often are) reused. For example, for a MapReduce job, which at its core serializes and deserializes billions of records of just a handful of different types, the savings gained by not having to allocate new objects are significant. In

Download sample

Download

M	T	W	T	F	S	S
« Jan
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28