Interview Questions

Frequently asked Interview Questions

Hadoop
Spark & Scala
Data Science
Python
DevOps
AWS

1) What is Hadoop Map Reduce?
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used.  Data analysis uses a two-step map and reduce process.

2) How Hadoop MapReduce works?
In MapReduce, during the map phase, it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase, the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

3) Explain what is shuffling in MapReduce?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

4) Explain what is distributed Cache in MapReduce Framework?
Distributed Cache is an important feature provided by the MapReduce framework. When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used.  The files could be an executable jar files or simple properties file.

5) Explain what is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System).  In other words, NameNode is the centerpiece of an HDFS file system.  It keeps the record of all the files in the file system and tracks the file data across the cluster or multiple machines

6) Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs,  JobTracker is used. Job tracker run on its own JVM process
Job Tracker performs following actions in Hadoop
Client application submit jobs to the job tracker
JobTracker communicates to the Name mode to determine data location
Near the data or with available slots JobTracker locates TaskTracker nodes
On chosen TaskTracker Nodes, it submits the work
When a task fails, Job tracker notifies and decides what to do then.
The TaskTracker nodes are monitored by JobTracker

7) Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker
8) Explain what combiners are and when you should use a combiner in a MapReduce Job?
To increase the efficiency of MapReduce Program, Combiners are used.  The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner.  The execution of combiner is not guaranteed in Hadoop
9) What happens when a data node fails?
When a data node fails

Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the user’s data to another node
10) Explain what is Speculative Execution?
In Hadoop during Speculative Execution, a certain number of duplicate tasks are launched.  On a different slave node, multiple copies of the same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking a long time to complete a task, Hadoop will create a duplicate task on another disk.  A disk that finishes the task first is retained and disks that do not finish first are killed.
11) Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
12) Explain what is the function of MapReduce partitioner?
The function of MapReduce partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps even distribution of the map output over the reducers
13) Explain what is a difference between an Input Split and HDFS Block?
The logical division of data is known as Split while a physical division of data is known as HDFS Block
14) Explain what happens in text format?
In text input format, each line in the text file is a record.  Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
15) Mention what are the main configuration parameters that user need to specify to run MapReduce Job?
The user of the MapReduce framework needs to specify
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
16) Explain what is WebDAV in Hadoop?
To support editing and updating files WebDAV is a set of extensions to HTTP.  On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
17)  Explain what is Sqoop in Hadoop?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
18) Explain how JobTracker schedules a task?
The task tracker sends out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning.  The message also informs JobTracker about the number of available slots, so the JobTracker can stay up to date with wherein the cluster work can be delegated
19) Explain what is Sequencefileinputformat?

Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
20) Explain what does the conf.setMapper Class do?
Conf.setMapperclass  sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper
21) Explain what is Hadoop?
It is an open-source software framework for storing data and running applications on clusters of commodity hardware.  It provides enormous processing power and massive storage for any type of data.
22) Mention what is the difference between an RDBMS and Hadoop?
RDBMS
Hadoop
RDBMS is a relational database management system
Hadoop is a node based flat structure
It used for OLTP processing whereas Hadoop
It is currently used for analytical and for BIG DATA processing
In RDBMS, the database cluster uses the same data files stored in a shared storage
In Hadoop, the storage data can be stored independently in each processing node.
You need to preprocess data before storing it
you don’t need to preprocess data before storing it
23) Mention Hadoop core components?
Hadoop core components include,
HDFS
MapReduce
24) What is NameNode in Hadoop?
NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.
25) Mention what are the data components used by Hadoop?
Data components used by Hadoop are
Pig
Hive
26) Mention what is the data storage component used by Hadoop?
The data storage component used by Hadoop is HBase.
27) Mention what are the most common input formats defined in Hadoop?
The most common input formats defined in Hadoop are;
TextInputFormat
KeyValueInputFormat
SequenceFileInputFormat
28) In Hadoop what is InputSplit?
It splits input files into chunks and assigns each split to a mapper for processing.
29) For a Hadoop job, how will you write a custom partitioner?

You write a custom partitioner for a Hadoop job, you follow the following path
Create a new class that extends Partitioner Class
Override method getPartition
In the wrapper that runs the MapReduce
Add the custom partitioner to the job by using method set Partitioner Class or – add the custom partitioner to the job as a config file
30) For a job in Hadoop, is it possible to change the number of mappers to be created?
No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits.
31) Explain what is a sequence file in Hadoop?
To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.
32) When Namenode is down what happens to job tracker?
Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.
33) Explain how indexing in HDFS is done?
Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.
34) Explain is it possible to search for files using wildcards?
Yes, it is possible to search for files using wildcards.
35) List out Hadoop’s three configuration files?
The three configuration files are
core-site.xml
mapred-site.xml
hdfs-site.xml
36) Explain how can you check whether Namenode is working beside using the jps command?
Besides using the jps command, to check whether Namenode are working you can also use
/etc/init.d/hadoop-0.20-namenode status.
37) Explain what is “map” and what is “reducer” in Hadoop?
In Hadoop, a map is a phase in HDFS query solving.  A map reads data from an input location, and outputs a key value pair according to the input type.
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
38) In Hadoop, which file controls reporting in Hadoop?
In Hadoop, the hadoop-metrics.properties file controls reporting.
39) For using Hadoop list the network requirements?
For using Hadoop the list of network requirements are:
Password-less SSH connection
Secure Shell (SSH) for launching server processes
40) Mention what is rack awareness?
Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.
41) Explain what is a Task Tracker in Hadoop?
A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.
42) Mention what daemons run on a master node and slave nodes?
Daemons run on Master node is “NameNode”
Daemons run on each Slave nodes are “Task Tracker” and “Data”
43) Explain how can you debug Hadoop code?
The popular methods for debugging Hadoop code are:
By using web interface provided by Hadoop framework
By using Counters
44) Explain what is storage and compute nodes?
The storage node is the machine or computer where your file system resides to store the processing data
The compute node is the computer or machine where your actual business logic will be executed.
45) Mention what is the use of Context Object?
The Context Object enables the mapper to interact with the rest of the Hadoop
system. It includes configuration data for the job, as well as interfaces which allow it to emit output.
46) Mention what is the next step after Mapper or MapTask?
The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output.
47) Mention what is the number of default partitioner in Hadoop?
In Hadoop, the default partitioner is a “Hash” Partitioner.
48) Explain what is the purpose of RecordReader in Hadoop?
In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.
49) Explain how is data partitioned before it is sent to the reducer if no custom partitioner is defined in Hadoop?
If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.
50) Explain what happens when Hadoop spawned 50 tasks for a job and one of the task failed?
It will restart the task again on some other TaskTracker if the task fails more than the defined limit.
51) Mention what is the best way to copy files between HDFS clusters?
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
52) Mention what is the difference between HDFS and NAS?
HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.
53) Mention how Hadoop is different from other data processing tools?
In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.
54) Mention what job does the conf class do?
Job conf class separate different jobs running on the same cluster.  It does the job level settings such as declaring a job in a real environment.
55) Mention what is the Hadoop MapReduce APIs contract for a key and value class?
For a key and value class, there are two Hadoop MapReduce APIs contract
The value must be defining the org.apache.hadoop.io.Writable interface
The key must be defining the org.apache.hadoop.io.WritableComparable interface
56) Mention what are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are
Pseudo distributed mode
Standalone (local) mode
Fully distributed mode
57) Mention what does the text input format do?
The text input format will create a line object that is an hexadecimal number.  The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.
58) Mention how many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits
1 split for 64K files
2 split for 65mb files
2 splits for 127mb files
59) Mention what is distributed cache in Hadoop?
Distributed cache in Hadoop is a facility provided by MapReduce framework.  At the time of execution of the job, it is used to cache file.  The Framework copies the necessary files to the slave node before the execution of any task at that node.
60) Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?
Classpath will consist of a list of directories containing jar files to stop or start daemons.


1. What is Apache Spark?
 
Apache Spark is a lightning-fast unified analytics engine for large-scale distributed data processing and machine learning.
 
2. What are the features/benefits of Apache Spark?
 
Speed: Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-ting.
Ease of Use: Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
A Unified Engine: Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Coming Soon

Coming Soon

Coming Soon

Coming Soon

3. What advantages does Spark offer over Hadoop MapReduce?

·      Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
·      Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core as batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
·      Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
·      Spark can perform computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop
 
4. Enumerate the various components of the Spark Ecosystem?

Spark Core – Base engine for large-scale parallel and distributed data processing

Spark SQL – Integrates relational processing with Spark’s functional programming API

Spark Streaming – Used for processing real-time streaming data

MLlib – Used for applying machine learning

GraphX – Implements Graphs and graph-parallel computation

5. What are the various functions of Spark Core?
 
Spark Core acts as the base engine for large-scale parallel and distributed data processing. It is the distributed execution engine used in conjunction with the Java, Python, and Scala APIs that offer a platform for distributed ETL (Extract, Transform, Load) application development.
Various functions of Spark Core are:
·      Distributing, monitoring, and scheduling jobs on a cluster
·      Interacting with storage systems
·      Memory management and fault recovery
Furthermore, additional libraries built on top of the Spark Core allow it to diverse workloads for machine learning, streaming, and SQL query processing.
 
6. Mention various modes of running Spark

Ø local
Ø clustered
·      Spark Standalone
·      Spark on Hadoop YARN
·      Spark on Apache Mesos
 
7. What is RDD?
 
RDD is the acronym for Resilient Distributed Dataset. It is the Spark’s primary core abstraction.
RDD is a fault-tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature.
Fundamentally, RDDs are portions of data that are stored in the memory distributed over many nodes. These RDDs are lazily evaluated in Spark, which is the main factor contributing to the hastier speed achieved by Apache Spark.

8.  How can we create RDD in Apache Spark?
There are two ways of creating an RDD in Apache Spark:
·      By parallelizing a collection in the driver program. It makes use of SparkContext’s parallelize() method.
·      By means of loading an external dataset from some external storage, including HDFS, S3, HBase and shared file system.

9. Define Partitions in Apache Spark.
 
As the name suggests, partition is a smaller and logical division of data like ‘split’ in MapReduce. It is a logical chunk of a large distributed data set.
Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Everything in Spark is a partitioned RDD.

10. What operations does RDD support?

RDDs support two types of operations:
Transformations:
Transformations are operations on RDD that create one or more new RDDs. Examples include map, filter, reduceByKey etc. In other words, transformations are functions that take an RDD as the input and produce one or more RDDs as the output. There is no change in the input RDD, but it always produces one or more new RDDs by applying the computation they represent.

Transformations are lazy, i.e. they are not executed immediately. Only after calling an action, transformations are executed.
Actions:
Actions return final results of the RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. Examples included collect, count, take etc.
 
11. What are the types of Apache Spark transformations?

There are fundamentally two types of transformations:
Narrow Transformations: Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained.
            An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the results
Wide Transformations: Wide transformations are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may reside in many partitions of the parent RDD. Wide transformation results in a new stage with a new set of partitions in the stage.

12. What is RDD Lineage?
 
An RDD Lineage Graph (aka RDD operator graph) is a graph of all the parent RDDs of an RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
            An RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
            Lineage graph is used to recover RDDs from a failure.

13. What is Spark Context in Apache Spark?
 
A SparkContext is a client of Spark’s execution environment and it acts as the master of the Spark application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. You can create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext stops) after the creation of SparkContext. Only one SparkContext may be active per JVM. You must stop () the active SparkContext before creating a new one.
 
In Spark shell, a special interpreter-aware SparkContext is already created for the user, in the variable called sc.
The first step of any Spark driver application is to create a SparkContext. The SparkContext allows the Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark’s Cluster Manager.

14.  By Default, how many partitions are created in RDD in Apache Spark?
 
By Default, Spark creates one Partition for each block of the file (for HDFS). Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2). However, one can explicitly specify the number of partitions to be created.

15. What is the difference between RDD Lineage and DAG?
 
An RDD Lineage is a graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. It is built as a result of applying transformation to the RDD and creates a logical execution plan.
An RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
RDD lineage is accessed by calling toDebugString() method on a RDD.
 
Whereas, DAG represents the physical execution plan, known as DAG of stages.
DAG in Apache Spark is a combination of Vertices as well as Edges. In DAG vertices represent the RDDs and the edges represent the Operation to be applied on RDD. Every edge in DAG is directed from earlier to later in a sequence. When we call an Action, the created DAG is submitted to DAG Scheduler which further splits the graph into the stages of the tasks.
 
16. What is the role of Spark Driver in Spark application?
 
A Spark driver (aka an application’s driver process) is a JVM process that hosts SparkContext for a Spark application.
·      It is the cockpit of jobs and tasks execution (using DAGScheduler and Task Scheduler). It hosts Web UI for the environment.
·      It splits a Spark application into tasks and schedules them to run on executors.
·      A driver is where the task scheduler lives and spawns’ tasks across workers.
·       A driver coordinates workers and overall execution of tasks.

17. Difference between map() and flatMap() in Spark?
 
map() and flatMap() are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator.
Also, the output of the flatMap() is flattened. Although the function in flatMap returns a list of elements, the flatMap returns an RDD which has all the elements from the list in a flat way (not a list).
map(func) It returns a new RDD by applying given function to each element of the RDD. Function in map returns only one item.
flatMap(func) Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.

18. Difference between groupByKey() and reduceByKey() in Spark?

groupByKey – When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
When a groupByKey() is called on a RDD pair the data in the partitions are shuffled over the network to form a key and list of values. This is a costly operation particularly when working on large data set. This might also cause trouble when the combined value list is huge to occupy in one partition. In this case a disk spill will occur.
 
reduceByKey(function) – When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. The function should be able to take arguments of some type and it returns same result data type.
Unlike groupByKey , reduceByKey does not shuffle data at the beginning. As it knows the reduce operation can be applied in same partition first, only result of reduce function is shuffled over network. This cause significant reduction in traffic over network.

19. Difference between Cache and Persist in Spark?
 
Cache and Persist both are optimization techniques for Spark computations.
Cache is a synonym of Persist with MEMORY_ONLY storage level i.e., using caching, we can save intermediate results in memory only when needed.
Persist marks an RDD for persistence using storage level which can be MEMORY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2.
Just because you can cache an RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset gets accessed and the amount of work involved in doing so, re-computation can be faster by the increased memory pressure.
It should go without saying that if you only read a dataset once there is no point in caching it, it will make your job slower.
 
20. What are shared variables in Spark?
 
Shared variables are the variables that are required to be used by many functions & methods in parallel. Shared variables can be used in parallel operations like transformations.
Spark segregates the job into the smallest possible operation, a task, running on different nodes and each having a copy of all the variables of the Spark job. Any changes made to these variables doesn’t reflect in the driver program and hence to overcome this limitation Spark provides two special type of shared variables – Broadcast Variables and Accumulators.
Broadcast variables:
Used to cache a value in memory on all nodes. Here only one instance of this read-only variable is shared between all computations throughout the cluster.
Spark sends the broadcast variable to each node concerned by the related task. After that, each node caches it locally in serialised form. Now before executing each of the planned tasks instead of getting values from the driver system retrieves them locally from the cache.
Broadcast variables are immutable, distributed i.e., broadcasted to the cluster and fit in memory.
Accumulators:
As its name suggests Accumulators main role is to accumulate values. The accumulators are variables that are used to implement counters and sums. Spark provides accumulators of numeric type only. Users can create named or unnamed accumulators.
Unlike Broadcast variables, accumulators are writable. However, written/updated values can be only read in driver program. This is the reason why accumulators work well as data aggregators.

21. Difference between map() and mapPartitions() in Spark?

map() – It returns a new RDD by applying the given function to each element of the RDD. The function in the map returns only one item.
mapPartitions() – This is a specialized map that is called only once for each partition.That is, the function runs separately on each partition (block) of the RDD, so the function must be of type Iterator<T> ⇒ Iterator<U> when running on an RDD of type T.
map() exercises function at per element/record level whereas mapPartitions() exercises function at the partition level.

22. Difference between coalesce() and repartition() in Spark?

coalesce uses existing partitions to minimize the amount of data that’s shuffled. That is, it works with existing partitions and shuffle a subset of them.
repartition creates new partitions and does a full shuffle. That is, it ignores existing partitions and create new ones.
coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes) and repartition results in roughly equal sized partitions.
 
In a simple way, coalesce is only used to decrease the number of partitions, no shuffling of data it just compress the partitions i.e., adjust within the existing partitions.
repartition can be used to both decrease and increase the number of partitions as full shuffling takes place.
 
23. What does “Stage Skipped” mean in Apache Spark web UI?
 
Stage Skipped means that data has been fetched from cache and re-execution of the given stage is not required. Basically, the stage has been evaluated before, and the result is available without re-execution. Whenever there is shuffling involved Spark automatically caches generated data.
 
24.  Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?
 
No. If two RDDs have the same partitioner, there will be no shuffle caused by the join.
However, keep in mind that the lack of a shuffle does not mean that no data will have to be moved between nodes. It’s possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it’s something to keep in mind. Co-location can improve performance but is hard to guarantee.

25. In a given spark program, how will you identify whether a given operation is Transformation or Action?
 
One can identify the operation based on the return type –
i) The operation is an action, if the return type is other than RDD.
ii) The operation is transformation, if return type is same as RDD.

26. What is Lazy evaluation in Spark?
 
Spark is intellectual in the way it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
 
27. What do you understand by Pair RDD?
 
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs.
Pair RDDs allow users to access each key in parallel. Pair RDDs have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.  
28. How Can You Minimize Data Transfers When Working with Spark?
 
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
·      Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
·      Using Accumulators – Accumulators help update the values of variables in parallel while executing.
·      The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.



29. What is Spark SQL? List the functions of Spark SQL.
 
Spark SQL is a Spark interface to work with structured as well as semi-structured data. Through this module, Spark executes relational SQL queries on the data.
The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is like a table in relational database.
 
Spark SQL is capable of:
·      Loading data from a variety of structured sources
·      Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
·      Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more

30. What is the advantage of a Parquet file?
 
Parquet file is a columnar format file that helps –
·      Limit I/O operations
·      Consumes less space
·      Fetches only required columns.

31. Difference between Spark SQL and Hive

·      Spark SQL is faster than Hive.
·      Any Hive query can easily be executed in Spark SQL but vice-versa is not true.
·      Spark SQL is a library whereas Hive is a framework.
·      It is not mandatory to create a metastore in Spark SQL, but it is mandatory to create a Hive metastore.
·      Spark SQL automatically infers the schema whereas in Hive schema needs to be explicitly defined.

32. What is Catalyst optimizer in Apache Spark?
 
The optimizer used by Spark SQL is Catalyst optimizer. Catalyst optimizer optimizes the logical plan of the SQL queries written in Spark SQL and DataFrame DSL. It’s a set of rules which apply on the SQL query to rewrite it in better way to gain performance.

WhatsApp us