ML Fundamentals
1. What is the bias-variance tradeoff?
Answer: Bias is error from overly simplistic assumptions (underfitting). Variance is error from sensitivity to training data fluctuations (overfitting). The tradeoff: increasing model complexity reduces bias but increases variance. The goal is finding the sweet spot that minimizes total error. Regularization (L1/L2), cross-validation, and ensemble methods help manage this balance.
2. Explain the difference between L1 and L2 regularization.
Answer: L1 (Lasso) adds the absolute value of weights to the loss function, encouraging sparsity — some weights become exactly zero, performing feature selection. L2 (Ridge) adds the squared weights, shrinking all weights toward zero without eliminating any. L1 is better when you suspect many irrelevant features; L2 is better when all features contribute.
3. What is cross-validation and why is it important?
Answer: Cross-validation splits data into K folds, training on K-1 folds and validating on the remaining one, rotating K times. It provides a more reliable estimate of model performance than a single train/test split, reduces the chance of lucky/unlucky splits, and helps detect overfitting. 5-fold or 10-fold CV is standard.
4. How do you handle imbalanced datasets?
Answer: Options include: oversampling the minority class (SMOTE), undersampling the majority class, using class weights in the loss function, ensemble methods (balanced random forests), anomaly detection approaches, and choosing appropriate metrics (precision-recall rather than accuracy). The best approach depends on the specific problem and data size.
5. What is the curse of dimensionality?
Answer: As the number of features increases, the data becomes increasingly sparse in the feature space. Distance metrics become less meaningful, models need exponentially more data, and overfitting risk increases. Solutions include feature selection, dimensionality reduction (PCA, t-SNE), and regularization.
Deep Learning
6. Explain backpropagation.
Answer: Backpropagation computes gradients of the loss function with respect to each weight using the chain rule. Starting from the output layer, it propagates error backward through the network. Each weight's gradient indicates the direction and magnitude to adjust it to reduce loss. Combined with gradient descent, it updates weights iteratively during training.
7. What is the vanishing gradient problem?
Answer: In deep networks with sigmoid/tanh activations, gradients become exponentially smaller as they propagate backward through layers. Early layers barely update their weights, preventing the network from learning long-range patterns. Solutions: ReLU activation (gradients are either 0 or 1), residual connections (skip connections), batch normalization, LSTM/GRU cells for RNNs.
8. Compare batch normalization and layer normalization.
Answer: Batch norm normalizes across the batch dimension for each feature — effective for CNNs but depends on batch size. Layer norm normalizes across features for each sample — independent of batch size, preferred in transformers and RNNs. Batch norm has running statistics for inference; layer norm computes statistics per-sample at all times.
9. What is dropout and how does it work?
Answer: Dropout randomly sets neuron outputs to zero during training with probability p (typically 0.1-0.5). This forces the network to learn redundant representations, preventing co-adaptation of neurons. At inference, all neurons are active but outputs are scaled by (1-p). It acts as an implicit ensemble of sub-networks.
10. Explain the attention mechanism.
Answer: Attention computes a weighted sum of values based on the compatibility between a query and keys. For each query token, it calculates dot products with all keys, applies softmax to get attention weights, then sums values weighted by these scores. Self-attention uses the same sequence for queries, keys, and values. Multi-head attention runs this in parallel with different learned projections.
NLP & Transformers
11. How does the transformer architecture work?
Answer: Transformers use stacked layers of multi-head self-attention and feed-forward networks with residual connections and layer normalization. Positional encodings inject sequence order. Encoder-only (BERT) for understanding, decoder-only (GPT) for generation, encoder-decoder (T5) for sequence-to-sequence tasks. Key advantage over RNNs: full parallelization and direct long-range dependencies.
12. What is the difference between BERT and GPT?
Answer: BERT uses bidirectional self-attention (sees all tokens) and is pre-trained with masked language modeling — predicting masked words from context. GPT uses causal (left-to-right) self-attention and is pre-trained with next-token prediction. BERT excels at understanding tasks (classification, NER). GPT excels at generation tasks.
13. What is tokenization and why does it matter?
Answer: Tokenization splits text into units (tokens) the model processes. Methods: word-level (simple but large vocabulary), character-level (small vocabulary but loses meaning), subword (BPE/SentencePiece — best balance). BPE splits rare words into common subwords. Token count affects context window size, processing speed, and model capability.
14. Explain fine-tuning vs. few-shot learning vs. RAG.
Answer: Fine-tuning updates model weights on a specific dataset — highest quality but expensive. Few-shot learning provides examples in the prompt without modifying weights — quick but limited. RAG retrieves relevant documents at inference time and includes them in the context — keeps knowledge current without retraining. Use fine-tuning for domain adaptation, few-shot for quick prototyping, RAG for factual accuracy with changing data.
15. What is a vector embedding?
Answer: A dense numerical representation of data (text, images, etc.) in a continuous vector space where similar items are close together. Word2Vec maps words to vectors where "king - man + woman ≈ queen." Sentence embeddings capture semantic meaning. Used in search, recommendation, clustering, and as inputs to downstream models.
System Design
16. Design a recommendation system.
Answer: Approaches: collaborative filtering (user-item interactions), content-based (item features), hybrid. Modern systems: train embedding models on user-item interactions, store embeddings in a vector database (Pinecone, Qdrant), use approximate nearest neighbor search for real-time recommendations. Key considerations: cold start problem, implicit vs explicit feedback, A/B testing, serving latency.
17. How would you deploy an ML model to production?
Answer: Steps: containerize the model (Docker), create a REST API (FastAPI/Flask), implement model versioning, set up CI/CD pipeline, deploy to cloud (AWS SageMaker, GCP Vertex AI, or Kubernetes), implement monitoring (data drift, prediction drift, latency), create a rollback strategy, set up A/B testing infrastructure.
18. Design a RAG system for a customer support chatbot.
Answer: Components: document ingestion pipeline (chunk documents, generate embeddings via embedding model), vector store (Pinecone/Chroma), retrieval layer (semantic search on user query), LLM for generation (Claude/GPT with retrieved context in prompt), guardrails for safety. Key decisions: chunk size, embedding model, number of retrieved documents, prompt template, caching strategy.
Coding & Implementation
19. Implement gradient descent from scratch.
Answer: Initialize weights randomly. For each iteration: compute predictions, calculate loss (MSE), compute gradients (dL/dw), update weights: w = w - learning_rate * gradient. Repeat until convergence. Key concepts: learning rate selection, batch vs mini-batch vs stochastic GD, momentum, Adam optimizer.
20. Implement a basic neural network forward pass.
Answer: For each layer: z = W @ x + b (linear transformation), a = activation(z). Output layer uses softmax for classification or linear for regression. Shape management is critical: input (batch_size, features), weights (features, hidden_dim), output (batch_size, hidden_dim).
Additional Common Questions
21-25: Quick-fire answers
- 21. Precision vs Recall? Precision = TP/(TP+FP), Recall = TP/(TP+FN). Use precision when false positives are costly (spam detection). Use recall when false negatives are costly (disease screening).
- 22. Explain gradient clipping. Caps gradients to a maximum norm during backpropagation. Prevents exploding gradients in deep networks and RNNs. Typical max norm: 1.0 to 5.0.
- 23. What is transfer learning? Using a model pre-trained on a large dataset as a starting point for a new task. Fine-tune the last layers on your specific data. Dramatically reduces data and compute requirements.
- 24. Explain data augmentation. Creating modified versions of training data to increase diversity. Images: rotation, flip, crop, color jitter. Text: synonym replacement, back-translation, random insertion. Reduces overfitting and improves generalization.
- 25. What is model distillation? Training a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's soft probability outputs, not just hard labels. Used to deploy efficient models on edge devices.
26-30: Behavioral & Practical
- 26. Describe an ML project you led. Use the STAR format: Situation, Task, Action, Result. Emphasize impact metrics and technical decisions.
- 27. How do you decide which model to use? Start simple (logistic regression, random forest), evaluate baseline, iterate with more complex models if needed. Consider interpretability requirements, data size, latency constraints, and maintenance burden.
- 28. How do you handle model drift? Monitor input distributions and prediction outputs. Set alerts for statistical drift (KL divergence, PSI). Retrain on fresh data periodically. A/B test new vs old models before full rollout.
- 29. What's your approach to debugging a model that's not learning? Check data quality first, verify labels, reduce to minimal example, check learning rate, monitor gradients (vanishing/exploding), verify loss function, try overfitting on small batch.
- 30. How do you stay current in AI? Follow arxiv papers via Papers With Code, attend conferences (NeurIPS, ICML), follow researchers on Twitter/X, join Hugging Face community, experiment with new models and tools.
Master these concepts in depth with our interactive lessons. Start with Introduction to AI, then dive into Machine Learning Fundamentals, Neural Networks, and Transformer Architecture. Get full access to all 31 lessons and ace your AI interview.