Q1. Difference between HashMap and ConcurrentHashMap?
👉 HashMap is not thread-safe. Concurrent modifications lead to ConcurrentModificationException.
👉 ConcurrentHashMap uses segment/stripe locking (Java 8 → CAS + synchronized blocks) allowing concurrent reads and controlled writes.

Q2. How do you avoid deadlocks in Java?
👉 Strategies:

· Lock ordering (always acquire locks in a fixed sequence).

· Use tryLock with timeout.

· Minimize lock scope.

· Prefer concurrent data structures over manual synchronization.

Q3. Difference between parallelStream() and stream()?
👉 stream() is sequential, runs on single thread.
👉 parallelStream() uses ForkJoinPool, splits workload across multiple threads.
⚠️ Best for CPU-bound tasks, not for I/O-heavy workloads.

Q4. How do you implement retry with exponential backoff in Java?
👉 Use ScheduledExecutorService or libraries like Resilience4j. Example: retry after 1s, 2s, 4s, … until max retries.

Q5. REST API idempotency – how to design PUT vs POST?
👉 POST: create (non-idempotent).
👉 PUT: create/update (idempotent – same request multiple times = same result).
👉 Use Idempotency-Key header for banking/payment APIs.

Section 2 – Spark

Q6. RDD vs DataFrame vs Dataset?
👉 RDD: low-level, type-unsafe.
👉 DataFrame: high-level API, optimized by Catalyst, untyped.
👉 Dataset: combines RDD (type safety) + DataFrame (optimizations).

Q7. How do you handle skewed data in Spark?
👉 Techniques:

· Salting keys.

· Repartitioning/shuffle partition tuning.

· Use broadcast joins for small tables.

· Skew join optimization in Spark SQL.

Q8. What is checkpointing in Spark Streaming?
👉 Saves state + metadata to HDFS (or other reliable storage).
👉 Used for recovery after failures.
👉 Two types: Metadata checkpointing, Data checkpointing.

Q9. How do you achieve fault tolerance in Spark Streaming?
👉 Input data must be replayable (Kafka/Flume).
👉 Spark uses lineage + checkpointing to recompute lost partitions.

Q10. Spark job optimization techniques?

· Cache/persist intermediate results.

· Minimize shuffles (use mapPartitions, reduceByKey).

· Use partition pruning, predicate pushdown.

· Avoid UDFs when SQL functions exist.

Section 3 – Hadoop Ecosystem

Q11. HDFS block size and why large?
👉 Default: 128MB/256MB. Large blocks → fewer seeks, better throughput.

Q12. NameNode vs DataNode?
👉 NameNode: stores metadata.
👉 DataNode: stores actual blocks.

Q13. Hive – external vs managed table?
👉 Managed: Hive controls lifecycle (drop = delete data).
👉 External: Hive stores only metadata, data remains even if dropped.

Q14. ORC vs Parquet?
👉 ORC: optimized for Hive, good compression, lightweight metadata.
👉 Parquet: language-agnostic, nested data support, better for Spark.

Q15. HBase row-key design – best practices?
👉 Short, unique, evenly distributed (avoid hotspotting).
👉 Reverse timestamps for time-series data.

Section 4 – Messaging Systems

Q16. Kafka vs RabbitMQ difference?
👉 Kafka: distributed log, high throughput, partitioned, replay capability.
👉 RabbitMQ: traditional broker, push-based, supports complex routing.

Q17. How does Kafka ensure ordering?
👉 Messages are ordered within a partition.
👉 To guarantee ordering → all related keys must go to the same partition.

Q18. How do you handle offset management in Kafka?
👉 Auto commit vs manual commit.
👉 Best practice: commit offset after processing the message.

Q19. Kafka “exactly once semantics” – how?
👉 Idempotent producer (enable.idempotence=true).
👉 Transactions API for producer-consumer pipeline.

Q20. What if Kafka broker goes down?
👉 ISR (in-sync replicas) ensures another replica is promoted as leader.
👉 Producer/consumer retries with leader election.

Section 5 – DevOps / CI-CD / TDD

Q21. Maven vs SBT vs Ant?
👉 Maven: convention over configuration, XML based.
👉 SBT: Scala-based, better for Spark/Scala projects.
👉 Ant: older, no dependency management.

Q22. Jenkins pipeline stages for Spark job?

1. Checkout code (Git).

2. Compile & run unit tests.

3. Build JAR with Maven.

4. Run integration tests on staging cluster.

5. Deploy to production (submit via spark-submit).

Q23. Git – difference between merge and rebase?
👉 Merge: keeps branch history.
👉 Rebase: linearizes history (cleaner, but dangerous if misused).

Q24. JUnit 5 improvements over JUnit 4?
👉 @ParameterizedTest, @Nested, @DisplayName, better assertions.

Q25. How do you ensure test coverage for streaming apps?
👉 Use embedded Kafka for testing.
👉 Use MemoryStream in Spark Structured Streaming.
👉 Validate outputs against expected results.

Section 6 – Scenario/Case Studies

Q26. A Spark Streaming job is lagging behind real-time. What do you check?

· Check Kafka consumer lag (kafka-consumer-groups.sh).

· Check batch interval & processing time.

· Increase parallelism (spark.streaming.kafka.maxRatePerPartition).

· Optimize transformations (reduce shuffles).

Q27. You get OutOfMemoryError in Spark. How to debug?

· Increase executor memory.

· Enable Kryo serialization.

· Use persist(StorageLevel.DISK_ONLY) for large datasets.

· Avoid wide transformations.

Q28. You have 1TB JSON logs in HDFS, query is slow. How to improve?
👉 Convert JSON → ORC/Parquet (columnar, compressed).
👉 Partition by date/time.
👉 Use vectorized queries in Hive/Spark.

Q29. Kafka consumer is reprocessing same messages after restart. Why?
👉 Offsets not committed properly.
👉 Solution: commit offsets after message is processed.

Q30. Spark job with shuffle stage takes too long. How to optimize?
👉 Increase shuffle partitions.
👉 Use map-side combine.
👉 Avoid skew by salting.

Section 7 – Behavioral / Banking Domain

Q31. Tell me about a time you optimized a big data pipeline.
👉 Example: Reduced Spark shuffle time by tuning partitions + using broadcast joins → improved SLA by 40%.

Q32. How do you ensure data security in Big Data pipelines?
👉 Encrypt data at rest (HDFS TDE) & in transit (TLS).
👉 Mask PII fields before ingestion.
👉 Use Ranger/Atlas for access control & lineage.

Q33. How do you ensure regulatory compliance (GDPR/CCPA)?
👉 Right-to-erasure implementation (delete user data from HDFS/Hive).
👉 Audit logs for all access.

✅ Quick Tips Before Monjin Video Interview

· Monjin = scenario-based + coding → expect to share screen & code live (Java + Spark + Kafka basics).

· Keep 2–3 real project examples (data pipeline, Kafka ingestion, Spark optimization).

· Stress on performance tuning, scalability, resilience – Synechron’s banking clients need this.

· Keep buzzwords handy: "idempotency", "backpressure handling", "predicate pushdown", "exactly-once semantics".