Q1. Difference between HashMap and ConcurrentHashMap?
👉 HashMap is not thread-safe. Concurrent modifications lead to ConcurrentModificationException.
👉 ConcurrentHashMap uses segment/stripe locking (Java 8 CAS + synchronized blocks) allowing concurrent reads and controlled writes.

 


Q2. How do you avoid deadlocks in Java?
👉 Strategies:

·        Lock ordering (always acquire locks in a fixed sequence).

·        Use tryLock with timeout.

·        Minimize lock scope.

·        Prefer concurrent data structures over manual synchronization.

·         


Q3. Difference between parallelStream() and stream()?
👉 stream() is sequential, runs on single thread.
👉 parallelStream() uses ForkJoinPool, splits workload across multiple threads.
⚠️ Best for CPU-bound tasks, not for I/O-heavy workloads.

 


 

Q4. How do you implement retry with exponential backoff in Java?
👉 Use ScheduledExecutorService or libraries like Resilience4j. Example: retry after 1s, 2s, 4s, … until max retries.



 

Q5. REST API idempotency – how to design PUT vs POST?
👉 POST: create (non-idempotent).
👉 PUT: create/update (idempotent – same request multiple times = same result).
👉 Use Idempotency-Key header for banking/payment APIs.

 


 

Section 2 – Spark

Q6. RDD vs DataFrame vs Dataset?
👉 RDD: low-level, type-unsafe.
👉 DataFrame: high-level API, optimized by Catalyst, untyped.
👉 Dataset: combines RDD (type safety) + DataFrame (optimizations).

 


 

Q7. How do you handle skewed data in Spark?
👉 Techniques:

·        Salting keys.

·        Repartitioning/shuffle partition tuning.

·        Use broadcast joins for small tables.

·        Skew join optimization in Spark SQL.

 


 

Q8. What is checkpointing in Spark Streaming?
👉 Saves state + metadata to HDFS (or other reliable storage).
👉 Used for recovery after failures.
👉 Two types: Metadata checkpointing, Data checkpointing.

 


 

Q9. How do you achieve fault tolerance in Spark Streaming?
👉 Input data must be replayable (Kafka/Flume).
👉 Spark uses lineage + checkpointing to recompute lost partitions.

 


 

Q10. Spark job optimization techniques?

·        Cache/persist intermediate results.

·        Minimize shuffles (use mapPartitions, reduceByKey).

·        Use partition pruning, predicate pushdown.

·        Avoid UDFs when SQL functions exist.

 


 

 

Section 3 – Hadoop Ecosystem

 

Q11. HDFS block size and why large?
👉 Default: 128MB/256MB. Large blocks fewer seeks, better throughput.

 


 

Q12. NameNode vs DataNode?
👉 NameNode: stores metadata.
👉 DataNode: stores actual blocks.

 


 

Q13. Hive – external vs managed table?
👉 Managed: Hive controls lifecycle (drop = delete data).
👉 External: Hive stores only metadata, data remains even if dropped.

 


 

Q14. ORC vs Parquet?
👉 ORC: optimized for Hive, good compression, lightweight metadata.
👉 Parquet: language-agnostic, nested data support, better for Spark.

 


 

Q15. HBase row-key design – best practices?
👉 Short, unique, evenly distributed (avoid hotspotting).
👉 Reverse timestamps for time-series data.

 


 

Section 4 – Messaging Systems

 

Q16. Kafka vs RabbitMQ difference?
👉 Kafka: distributed log, high throughput, partitioned, replay capability.
👉 RabbitMQ: traditional broker, push-based, supports complex routing.

 


 

Q17. How does Kafka ensure ordering?
👉 Messages are ordered within a partition.
👉 To guarantee ordering all related keys must go to the same partition.

 


 

Q18. How do you handle offset management in Kafka?
👉 Auto commit vs manual commit.
👉 Best practice: commit offset after processing the message.

 


 

Q19. Kafka “exactly once semantics” – how?
👉 Idempotent producer (enable.idempotence=true).
👉 Transactions API for producer-consumer pipeline.

 


 

Q20. What if Kafka broker goes down?
👉 ISR (in-sync replicas) ensures another replica is promoted as leader.
👉 Producer/consumer retries with leader election.

 


 

Section 5 – DevOps / CI-CD / TDD

Q21. Maven vs SBT vs Ant?
👉 Maven: convention over configuration, XML based.
👉 SBT: Scala-based, better for Spark/Scala projects.
👉 Ant: older, no dependency management.

 


 

Q22. Jenkins pipeline stages for Spark job?

1.   Checkout code (Git).

2.   Compile & run unit tests.

3.   Build JAR with Maven.

4.   Run integration tests on staging cluster.

5.   Deploy to production (submit via spark-submit).

 


 

Q23. Git – difference between merge and rebase?
👉 Merge: keeps branch history.
👉 Rebase: linearizes history (cleaner, but dangerous if misused).

 


 

Q24. JUnit 5 improvements over JUnit 4?
👉 @ParameterizedTest, @Nested, @DisplayName, better assertions.

 


 

Q25. How do you ensure test coverage for streaming apps?
👉 Use embedded Kafka for testing.
👉 Use MemoryStream in Spark Structured Streaming.
👉 Validate outputs against expected results.

 


Section 6 – Scenario/Case Studies

 

Q26. A Spark Streaming job is lagging behind real-time. What do you check?

·        Check Kafka consumer lag (kafka-consumer-groups.sh).

·        Check batch interval & processing time.

·        Increase parallelism (spark.streaming.kafka.maxRatePerPartition).

·        Optimize transformations (reduce shuffles).

 


 

Q27. You get OutOfMemoryError in Spark. How to debug?

·        Increase executor memory.

·        Enable Kryo serialization.

·        Use persist(StorageLevel.DISK_ONLY) for large datasets.

·        Avoid wide transformations.

 


 

Q28. You have 1TB JSON logs in HDFS, query is slow. How to improve?
👉 Convert JSON ORC/Parquet (columnar, compressed).
👉 Partition by date/time.
👉 Use vectorized queries in Hive/Spark.

 


 

Q29. Kafka consumer is reprocessing same messages after restart. Why?
👉 Offsets not committed properly.
👉 Solution: commit offsets after message is processed.

 


 

Q30. Spark job with shuffle stage takes too long. How to optimize?
👉 Increase shuffle partitions.
👉 Use map-side combine.
👉 Avoid skew by salting.

 


Section 7 – Behavioral / Banking Domain

Q31. Tell me about a time you optimized a big data pipeline.
👉 Example: Reduced Spark shuffle time by tuning partitions + using broadcast joins improved SLA by 40%.

 


 

Q32. How do you ensure data security in Big Data pipelines?
👉 Encrypt data at rest (HDFS TDE) & in transit (TLS).
👉 Mask PII fields before ingestion.
👉 Use Ranger/Atlas for access control & lineage.

 


 

Q33. How do you ensure regulatory compliance (GDPR/CCPA)?
👉 Right-to-erasure implementation (delete user data from HDFS/Hive).
👉 Audit logs for all access.

 


Quick Tips Before Monjin Video Interview

·        Monjin = scenario-based + coding expect to share screen & code live (Java + Spark + Kafka basics).

·        Keep 2–3 real project examples (data pipeline, Kafka ingestion, Spark optimization).

·        Stress on performance tuning, scalability, resilience – Synechron’s banking clients need this.

·        Keep buzzwords handy: "idempotency", "backpressure handling", "predicate pushdown", "exactly-once semantics".