Imagine having a brilliant, tireless data science assistant at your fingertips – one that can write code in multiple languages, explain complex algorithms, brainstorm feature ideas, draft documentation, and even help debug your SQL queries, all at lightning speed. This isn’t a far-off dream; it’s the reality of working with Large Language Models (LLMs) like ChatGPT.
But like any powerful tool, its effectiveness hinges on one crucial factor: your instructions. Generic requests to ChatGPT often yield generic (and sometimes unhelpful) results. To truly harness its power for the multifaceted world of data science, you need to speak its language with precision and clarity.
We’ve curated an extensive collection of targeted, actionable prompts specifically designed to supercharge your data science workflow across every stage – from initial code generation and intricate data preprocessing to advanced model tuning and the clear communication of your findings. Whether you’re a seasoned data scientist looking to optimize your processes, an aspiring analyst building your toolkit, or an ML engineer seeking to accelerate development, these prompts are your launchpad.
Learn how to ask the right questions to get the most insightful answers, generate robust code, understand complex concepts, and ultimately, become more efficient and effective in your data science endeavors.
How to Use These Prompts:
Replace the bracketed placeholders [like this]
with your specific details, data, problem, or context. The more context you provide, the better ChatGPT’s response will be. If ChatGPT’s initial code or explanation has an issue, don’t hesitate to point it out clearly for a correction, e.g., “Your previous code response had an issue: [clearly point out what is wrong, e.g., 'it used a deprecated function', 'the logic for handling X was incorrect']
. Can you try again, addressing this specific issue?”
Dive in, customize these prompts, and transform ChatGPT into an indispensable partner on your data journey!
I. Code Generation & Manipulation
This section focuses on generating code for various languages and tools commonly used in data science.
- General Python Function:
Act as a Python code generator. Create a Python function to perform the following task:[clearly describe the task, e.g., 'read a CSV file into a pandas DataFrame and return the first 5 rows', 'calculate the moving average of a list of numbers with a window size N']
. Ensure the code is well-commented. - Python Script:
Act as a Python script writer. Write a Python script that will[describe the script's overall goal, e.g., 'scrape product names and prices from {website url}', 'automate the renaming of files in {directory path} based on {naming convention}']
. The script should handle potential errors like[e.g., file not found, network issues]
. - Python Module Development:
Act as a Python developer. Write a Python module to calculate[specific metric or functionality, e.g., 'various text similarity scores like Jaccard and Cosine', 'custom evaluation metrics for my model']
using data from[dataset description or placeholder, e.g., '{your dataframe variable}', '{input text list}']
. The module should include functions for[list key functionalities, e.g., 'calculating Jaccard similarity', 'calculating Cosine similarity', 'preprocessing input text']
. - NumPy Array Creation:
Act as a data scientist. Create a NumPy array with the shape({X dim}, {Y dim}, {Z dim})
initialized with[e.g., 'random integers between {min val} and {max val}', 'zeros', 'values drawn from a normal distribution with mean {mean} and std dev {std}']
. - Regex Generation:
Act as a regex writer. Generate a Python regular expression to[describe the pattern to match, e.g., 'extract all URLs', 'find phone numbers in US format', 'validate an email address']
from the following sample text:'{sample text or description of text}'
. - Multithreaded Python Function:
Act as a Python developer. Convert the following Python function to a multithreaded version to perform[task description, e.g., 'API calls to multiple endpoints', 'image processing on a batch of files']
on[input description, e.g., 'a list of URLs', 'a directory of images']
using[{N} threads]
threads. Original function:[paste your Python function here]
- SQL Code Generation:
Act as a SQL code generator for[SQL Dialect, e.g., 'PostgreSQL 14', 'MySQL 8', 'SQL Server 2019', 'SQLite']
. I have a table named{table name}
with columns:[{column name 1 (data type)}, {column name 2 (data type)}, ...]
. Write a query to[describe the desired SQL operation, e.g., 'select the top 5 customers by total sales amount', 'calculate the 7-day rolling average of {value column} partitioned by {category column}', 'find all products that have not been sold in the last {N months}']
. - Google Sheets Formula:
Act as a Google Sheets formula generator. Create a formula that[describe the desired spreadsheet operation, e.g., 'calculates the sum of cells A1:A50 if the corresponding cell in B1:B50 is "Complete"', 'finds the VLOOKUP for {search key} in range {range} returning the {column index}']
. - Excel VBA Macro:
Act as an Excel VBA developer. Write a VBA macro that[describe the function of the macro, e.g., 'loops through all selected cells and changes their background color to yellow', 'prompts the user for a folder path and imports all CSV files from that folder into separate new worksheets']
. - R Scripting:
Act as a data scientist using R. Write an R script to[describe the specific requirement, e.g., 'load the {dataset name}.csv file and perform a linear regression of {dependent variable} on {independent variable1} and {independent variable2}', 'create a bar chart using ggplot2 showing the average {value column} for each {category column} in the dataframe {df name}']
.
II. Data Understanding & Exploratory Data Analysis (EDA)
This section provides prompts for understanding your data and performing initial investigations.
- Comprehensive EDA Plan:
Act as a data scientist. I have a dataset described as:[describe your dataset, e.g., 'a pandas DataFrame named {df name} with columns {column list} related to employee attrition']
. Provide Python code (usingpandas
,matplotlib
,seaborn
) to perform comprehensive exploratory data analysis. Include steps for:[e.g., 'displaying .info() and .describe()', 'checking for missing values and visualizing them', 'distribution plots for key numerical columns like {numerical col 1} and {numerical col 2}', 'count plots for key categorical columns like {categorical col 1}', 'a correlation heatmap for numerical features', 'pair plots for a subset of important features']
. - Dataset Overview Questions:
I have a dataset about[domain/topic, e.g., online retail transactions, sensor readings from industrial machines]
with the following columns:[Column A (type), Column B (type), ...]
. Can you suggest 5 key exploratory data analysis (EDA) questions I should investigate to understand the data’s structure, quality, and potential insights? - Statistical Summary Plan:
For my dataset concerning[problem description, e.g., predicting customer lifetime value]
, with numerical features[Feature1, Feature2, ...]
and categorical features[CategoryA, CategoryB, ...]
, provide a plan to generate a comprehensive statistical summary. What key metrics should I focus on for each type of feature and why? - Identifying Relationships:
In my dataset for[project goal, e.g., understanding factors affecting student performance]
, how can I best explore the relationship between[Variable X, e.g., 'study hours']
and[Variable Y (target or related variable), e.g., 'exam score']
? Suggest statistical tests (e.g., Pearson correlation, chi-squared) and visualization types (e.g., scatter plot, box plot). - Outlier Detection Methods:
For a feature named[feature name]
(which is[numerical/categorical]
) in my dataset on[dataset topic, e.g., credit card transactions]
, what are 3 different methods to identify potential outliers (e.g., Z-score, IQR, Isolation Forest)? Provide Python code snippets using[pandas/numpy/scipy/sklearn]
for one of these methods. - Understanding Distributions & Transformations:
How can I check the distribution of the numerical feature[feature name]
in my dataset using Python (seaborn
ormatplotlib
)? If it’s skewed, what are 2-3 common transformation techniques (e.g., log, Box-Cox) to consider for a[type of model, e.g., linear regression, neural network]
model, and why might they be helpful?
III. Data Cleaning & Preprocessing
Prompts for cleaning raw data and preparing it for analysis or model training.
- Targeted Data Preprocessing Steps:
Act as a data analyst. Preprocess the dataset loaded into a pandas DataFrame named{df name}
(or located at[{file path}]
) by performing the following specific steps in order:[e.g., '1. Remove duplicate records based on all columns', '2. Handle missing values in {column name A} by filling with the mean', '3. Convert {column name B} from object to datetime', '4. Standardize {column name C} using StandardScaler']
. Provide Python code. - Handling Missing Values Strategies:
My dataset for[problem domain, e.g., patient health records]
has missing values in columns[Column X (e.g., 30% missing, numeric, e.g., 'blood pressure'), Column Y (e.g., 5% missing, categorical, e.g., 'smoker status')]
. What are 3 different strategies (e.g., mean/median/mode imputation, KNN imputation, predictive model imputation) to handle these missing values? Detail their pros, cons, and when each might be most appropriate for my target model[model type, e.g., Random Forest]
. - Categorical Variable Encoding Comparison:
I have categorical features[Feature A (e.g., 'City' - high cardinality), Feature B (e.g., 'Payment Method' - medium cardinality), Feature C (e.g., 'Binary Flag' - low cardinality)]
for my[type of model, e.g., Gradient Boosting]
model. Compare one-hot encoding, label encoding, ordinal encoding, and target encoding for these features. Which would you recommend for each and why? Provide Pythonsklearn
orcategory encoders
examples. - Text Data Preprocessing Pipeline:
I’m working with text data from[source, e.g., customer support tickets]
for[task, e.g., text classification]
. Outline a comprehensive text preprocessing pipeline. What are the essential steps (e.g., lowercasing, punctuation removal, tokenization, stop-word removal, stemming/lemmatization, TF-IDF/Word2Vec vectorization)? Provide a Python example using[NLTK/spaCy/sklearn]
for key steps. - Data Validation Rules:
Act as a data quality analyst. Write Python code usingpandas
(or a library likeGreat Expectations
orPandera
if you can provide a conceptual example) to validate the column named{column name}
in the DataFrame{dataframe name}
. The validation checks should ensure:[list requirements, e.g., 'the column contains only {valid data type, e.g., integers}', 'values are within the range {min val} to {max val}', 'there are no missing values', 'all entries match one of the predefined categories: {category list}', 'it does not contain outliers based on the IQR method']
.
IV. Feature Engineering & Selection
Guidance on creating new features from existing data and selecting the most relevant ones.
- Creating New Features:
Given features[Feature 1 (e.g., 'transaction date'), Feature 2 (e.g., 'user registration date'), Feature 3 (e.g., 'purchase amount')]
for predicting[target variable, e.g., 'repeat purchase propensity']
, suggest 3 potential new features I could engineer. Explain the rationale and provide Pythonpandas
implementation for each (e.g., ‘days since registration’, ‘average purchase frequency’). - Feature Scaling Comparison:
Explain the difference between Standardization (Z-score normalization) and Min-Max scaling. For features[Feature A (e.g., 'age' with range 20-70), Feature B (e.g., 'income' with range 30k-500k)]
in my dataset, when would I choose one over the other, especially if I plan to use[model type, e.g., K-Means clustering, SVM, Neural Network]
? Provide Pythonsklearn
examples. - Determining Feature Importance (Post-Model Training):
After training a[Model Name, e.g., LightGBM Classifier, Linear Regression]
on my dataset with features[list of features]
to predict[target variable]
, how can I determine and visualize the most important features? Provide a Pythonsklearn
(or library-specific) example for extracting and plotting feature importances. - Synthetic Data Generation for Augmentation/Testing:
Act as a synthetic data generator. Create a pandas DataFrame with[{X} rows]
rows and the following columns and characteristics:[{column name 1}: {description, e.g., 'integers uniformly distributed between 1 and 100'}, {column name 2}: {description, e.g., 'categorical with values "A", "B", "C" with probabilities {0.5, 0.3, 0.2}'}, {column name 3}: {description, e.g., 'normally distributed float with mean {mean val} and standard deviation {std dev val}'}, {column name 4}: {description, e.g., 'dates ranging from {start date} to {end date}'}]
. This data is for[purpose, e.g., testing a data pipeline, augmenting a small dataset]
.
V. Machine Learning – Model Building & Training
Prompts related to selecting, building, and training various machine learning models.
- Algorithm Suggestion for Task:
I am working on a[classification/regression/clustering/anomaly detection/NLP/etc.]
problem to[predict/analyze/goal, e.g., 'predict house prices', 'segment customers', 'detect fraudulent transactions']
based on[brief description of data: e.g., '100 numerical and 10 categorical features, 50k samples', 'a corpus of news articles', 'time series data of website traffic']
. Suggest 3 suitable machine learning algorithms and briefly explain why each might be a good fit, considering data characteristics and problem type. - Comparing Specific Algorithms:
Compare and contrast[Algorithm A, e.g., K-Means Clustering]
and[Algorithm B, e.g., DBSCAN]
for a[problem type, e.g., customer segmentation]
task on[type of data, e.g., data with non-spherical clusters and varying densities]
. Discuss their assumptions, pros, cons, key parameters, and interpretability. - Choosing a Baseline Model:
For my[problem type, e.g., multi-class classification]
task of predicting[target variable, e.g., 'product category']
, what would be a simple yet effective baseline model (e.g., DummyClassifier, Logistic Regression) to establish initial performance? Explain why and provide a Pythonsklearn
example. - Classification Model Training:
Act as a data scientist. I have a dataset[describe dataset, e.g., 'in a pandas DataFrame {df name} with features {feature list} and a binary target variable {target column name}']
. Build and train a[specific model, e.g., Logistic Regression, Random Forest, Support Vector Machine]
model to predict[{target column name}]
. Include steps for splitting data, training the model, and making predictions. Provide Pythonsklearn
code. - Clustering Implementation:
Act as a data scientist. Cluster the[items to cluster, e.g., 'customers based on demographics and purchase behavior', 'documents based on their TF-IDF vectors']
in the dataset[{dataset description or DataFrame name}]
into an appropriate number of groups (or suggest how to findk
). Use the[clustering algorithm, e.g., 'KMeans', 'AgglomerativeClustering']
algorithm. Provide Python code for its implementation and, if applicable, suggest how to visualize the clusters (e.g., using PCA for dimensionality reduction first). - Automated Machine Learning (AutoML):
Act as an AutoML bot using[AutoML library, e.g., 'TPOT', 'Auto-sklearn', 'FLAML']
. I am working on a model to predict[{target variable}]
(a[classification/regression]
task) based on the features in[{dataset description or path to csv}]
. Please write Python code to find the best model, optimizing for[{metric, e.g., 'ROC AUC', 'F1-score for class X', 'Mean Squared Error'}]
on the test set. Include code for basic data preprocessing if the library handles it or suggest what I should do first. - Natural Language Processing (NLP) Task:
Act as an NLP specialist. For the text data in[dataset description or column name, e.g., 'the 'review text' column of my DataFrame {df name}']
, perform[NLP task, e.g., 'sentiment analysis to classify feedback as positive/negative/neutral', 'topic modeling using LDA to identify {k topics} key themes', 'named entity recognition to extract organizations and locations']
. Explain the chosen approach/model (e.g., VADER,sklearn.decomposition.LatentDirichletAllocation
,spaCy NER
) and provide Python code. - Recommender System Development:
Act as a data scientist. Develop a[type of recommender system, e.g., 'item-based collaborative filtering', 'content-based using TF-IDF of product descriptions', 'matrix factorization using SVD']
recommender system. The system should suggest[items to recommend, e.g., 'movies', 'products']
to[users or entities, e.g., 'users based on their ratings', 'customers based on their purchase history']
. The data is in[data source description, e.g., 'a DataFrame {df name} with columns user id, item id, rating']
. Describe the methodology and provide Python code using a library like[Surprise/implicit/sklearn]
. - Time Series Analysis & Forecasting:
Act as a time series expert. I have a time series dataset:[describe dataset, e.g., 'daily stock prices for {ticker symbol} in {df name} with columns {date column} and {price column}']
. Build a model (e.g.,[ARIMA, SARIMA, Prophet, Exponential Smoothing, LSTM]
) to forecast[{target variable, e.g., 'price column'}]
for the next[{N time periods, e.g., '30 days'}]
. Use data from[{train start date} to {train end date}]
for training. Provide Python code using[statsmodels/pmdarima/prophet/tensorflow]
.
VI. Machine Learning – Model Tuning & Advanced Techniques
Prompts for refining models and applying more advanced machine learning techniques.
- Cross-Validation Strategy & Implementation:
Explain k-fold cross-validation and its importance for model evaluation. For my dataset of[N samples]
samples for a[classification/regression]
task, what value of ‘k’ would you recommend and why? Provide a Pythonsklearn
example of performing k-fold cross-validation for a[model name, e.g., GradientBoostingClassifier]
and calculating[metric, e.g., 'average ROC AUC']
. - Hyperparameter Tuning Code:
Act as a data scientist. I have trained a[{model name, e.g., 'XGBoost Classifier'}]
model. Write Python code to tune its hyperparameters using[tuning technique, e.g., 'GridSearchCV', 'RandomizedSearchCV', 'Optuna for Bayesian optimization']
. The hyperparameters to tune are[{hyperparameter 1: [range or list], hyperparameter 2: [range or list], ...}]
to optimize for[{metric, e.g., 'accuracy', 'log loss'}]
on the dataset[{dataset description or DataFrame name}]
. Show how to access the best parameters and score. - Addressing Overfitting/Underfitting:
My[model name, e.g., 'Decision Tree']
model, trained on[dataset description]
, shows[symptoms of overfitting/underfitting, e.g., 'training accuracy of 99% but validation accuracy of 70%' OR 'training and validation accuracy both low around 50%']
. What are 3 common strategies to mitigate this specific issue (e.g., regularization, more data, feature selection, early stopping, adjusting model complexity)? - Handling Imbalanced Data (Specific Technique):
Act as a data scientist. I have an imbalanced dataset where the target variable is[{target column name}]
in[dataset description or DataFrame name]
. The minority class constitutes[X]%
of the data. Provide Python code to address this imbalance using the[specific technique, e.g., 'SMOTE (Synthetic Minority Over-sampling Technique) from imbalanced-learn', 'RandomUnderSampler from imbalanced-learn', 'class weighting in the model {model name}']
. Show how to apply it within asklearn
pipeline if possible. - Anomaly Detection Implementation:
Act as a data scientist. Detect anomalies in[type of data, e.g., 'server CPU usage metrics', 'credit card transaction amounts']
from[data source description, e.g., 'the pandas DataFrame {df name} column {column to analyze}', 'a NumPy array of sensor readings']
using[machine learning algorithm or technique, e.g., 'Isolation Forest', 'Local Outlier Factor (LOF)', 'One-Class SVM']
. Describe the anomalies you are looking for:[e.g., 'sudden spikes or drops in usage', 'transactions significantly larger or smaller than typical patterns']
. Provide Pythonsklearn
code. - Dimensionality Reduction (Specific Technique):
Act as a data scientist. Reduce the dimensionality of the[data description, e.g., 'high-dimensional feature set from image embeddings', 'survey response data with many correlated questions']
in[{dataset description or DataFrame name}]
using[dimensionality reduction technique, e.g., 'Principal Component Analysis (PCA) to retain 95% variance', 't-SNE for 2D visualization', 'UMAP for {n components} components']
. Explain the key steps/parameters for the chosen technique and provide Pythonsklearn
(orumap-learn
) code.
VII. Machine Learning – Model Evaluation & Interpretation
How to evaluate the performance of your models and understand their predictions.
- Choosing Appropriate Evaluation Metrics:
My[classification/regression]
model is predicting[target variable (e.g., 'disease presence (binary)', 'customer satisfaction score (1-5 ordinal)', 'sales amount (continuous)')]
. Which evaluation metrics (e.g., Accuracy, Precision, Recall, F1-score, ROC AUC, Cohen’s Kappa for classification; RMSE, MAE, R-squared, MAPE for regression) are most appropriate for this business problem and why? Explain how to interpret[specific metric, e.g., 'Precision']
in this context. - Interpreting Model Predictions (LIME/SHAP):
Act as a data scientist. I have trained a[{model library and name, e.g., 'scikit-learn RandomForestClassifier', 'TensorFlow Keras Sequential model'}]
model on[{dataset description}]
. Write Python code to explain the model’s predictions using[LIME or SHAP]
. Specifically,[describe what you want to explain, e.g., 'explain the prediction for a specific instance: {instance data as dict or array}', 'show the global feature importance using SHAP summary plot', 'generate a SHAP dependence plot for {feature name}']
. - Comparing Model Performance Statistically:
I have trained[Model A (e.g., Logistic Regression)]
and[Model B (e.g., Random Forest)]
for predicting[target variable]
. On a held-out test set, Model A achieved[Metric A value, e.g., ROC AUC of 0.75]
and Model B achieved[Metric B value, e.g., ROC AUC of 0.78]
. How can I statistically compare their performance (e.g., using McNemar’s test for classification accuracy, or paired t-test on cross-validation scores) to determine if Model B is significantly better for[business objective]
? Provide a conceptual outline or Python example if possible.
VIII. Data Visualization
Prompts for creating effective visualizations to explore data and communicate findings.
- Choosing the Right Plot for Data:
I want to visualize[specific aspect, e.g., 'the distribution of product prices', 'the relationship between advertising spend and sales', 'the monthly trend of website visits', 'comparing sales across different product categories']
. My data includes[relevant column names and types, e.g., 'price (float)', 'ad spend (float), sales (float)', 'date (datetime), visits (int)', 'category (string), sales (float)']
. What type of plot (e.g., histogram, scatter plot, line chart, bar chart, box plot) would be most effective? Provide a Python code snippet using[Matplotlib/Seaborn/Plotly]
. - Dashboard Visualization Ideas:
Suggest 3-5 key visualizations (and the plot types) to include in a dashboard for monitoring[key performance indicator or area, e.g., 'e-commerce sales performance', 'social media engagement', 'real-time system health']
based on data with columns[relevant columns and their general meaning]
. - Visualizing Model Performance:
How can I visualize the performance of my[binary/multi-class classification model, e.g., Logistic Regression]
? Suggest plots like ROC curve, Precision-Recall curve, confusion matrix, or calibration curve. Explain what insights each provides and provide Pythonsklearn.metrics
ormatplotlib/seaborn
examples for generating one of them, e.g.,[the confusion matrix]
.
IX. Code Understanding, Improvement & Debugging
Assistance with understanding, optimizing, translating, and fixing code.
- Explain Code Snippet:
Act as a code explainer. Explain what the following[language, e.g., 'Python', 'SQL', 'R']
code is doing. Provide a step-by-step explanation, describe the overall purpose, and highlight any complex or non-obvious logic:[paste your code here]
- Optimize Code Performance/Readability:
Act as a code optimizer. The following[language, e.g., 'Python', 'SQL']
code is[describe issue, e.g., 'running very slowly on large data', 'hard to understand', 'using too much memory']
:[paste your code here]
. Suggest improvements to[e.g., 'enhance its time complexity', 'reduce memory usage by {target amount or method}', 'improve readability through better variable names and comments', 'simplify the logic for {specific part}']
. Provide the optimized code if possible. - Translate Code Between Languages:
Act as a code translator. Convert the following code snippet from[{source language, e.g., 'Python with pandas'}]
to[{target language, e.g., 'SQL for PostgreSQL', 'R with dplyr'}]
. Ensure the functionality remains identical:[paste your code here]
- Write Documentation for Code:
Act as a software developer. Generate documentation (e.g., a docstring in Python) for the following[language]
function/module. Include a clear description of its purpose, parameters (name, type, description, default value if any), what it returns (type, description), and any exceptions it might raise:[paste your Python function or code block here]
- Improve Code Readability:
Act as a code reviewer. Improve the readability and maintainability of the following Python code. Focus on aspects like naming conventions (PEP 8 for Python), comments where necessary, breaking down long functions, and overall code structure:[paste your Python code here]
- Format SQL Code:
Act as a SQL formatter. Format the following SQL code for[SQL Dialect, e.g., 'PostgreSQL']
. Apply these formatting rules:[e.g., 'convert all reserved keywords to uppercase', 'indent subqueries and CTEs', 'align clauses vertically', 'use consistent spacing around operators']
. SQL code:[paste your SQL code here]
- Debug Python Code:
Act as a Python debugger. The following Python code is supposed to[expected function or output, e.g., 'calculate the sum of a list of numbers']
but it’s producing[error message or incorrect behavior, e.g., 'a TypeError: unsupported operand type(s) for +: "int" and "str"' OR 'an incorrect sum of X instead of Y']
. Please help me debug it. Code:[paste your Python code here]
- Correct SQL Code Error:
Act as a SQL code corrector. This SQL query for[{your DBMS, e.g., 'PostgreSQL', 'MySQL', 'Oracle'}]
is not running correctly. The error I’m getting is:[{paste error message}]
(or it’s producing[incorrect result description]
). Can you correct the syntax or logic for my DBMS? SQL code:[paste your SQL code here]
X. Specific Tools & Business Intelligence
Prompts for working with specific data science tools and Business Intelligence platforms.
- Generic Library Usage:
Provide a Python code example using[library name, e.g., Pandas, Scikit-learn, Matplotlib]
to perform[specific task, e.g., 'merge two DataFrames df1 (on key "ID") and df2 (on key "user id") using an inner join', 'train a KNearestNeighbors classifier with n neighbors=5', 'create a scatter plot of column X vs column Y with points colored by column Z']
. - Troubleshooting Power BI Model/DAX:
Act as a Power BI modeler. I’m working on a Power BI project with the following details: Data sources include[e.g., 'SQL Server table Sales', 'Excel file Products']
. Relationships are[e.g., 'Sales[ProductID] to Products[ProductID] (many-to-one)']
. I am encountering[specific problem, e.g., 'incorrect totals in my DAX measure for "YoY Sales Growth"', 'a circular dependency error when creating a calculated column', 'slow performance on a report page displaying {visual type} with {measure name}']
. My DAX for the problematic measure is:[paste DAX measure here]
. Can you suggest potential issues or improvements?
XI. Conceptual Understanding & Project Guidance
Prompts for explaining data science concepts, planning projects, and getting career advice.
- Explain Data Science Concepts:
Explain the concept of[{data science concept, e.g., 'p-value', 'cross-validation', 'gradient descent', 'bias-variance tradeoff', 'regularization (L1 vs L2)'}]
to a[target audience, e.g., 'a five-year-old using an analogy', 'a business stakeholder with no technical background focusing on implications', 'an undergraduate data science student needing technical details']
. - A/B Testing Design:
Act as a statistician. I want to test the hypothesis that[hypothesis, e.g., 'changing the call-to-action button on my landing page from "Sign Up" to "Get Started Free" will increase conversion rate']
. Current conversion rate is[X]%
from[N]
weekly visitors. Please help design an A/B test. Include:- Key metrics to track (primary and secondary).
- How to calculate the required sample size (assuming
[desired power, e.g., 80%]
and[significance level, e.g., alpha=0.05]
, and[minimum detectable effect, e.g., 2% uplift]
). - The statistical test to use for comparing results (e.g., chi-squared, z-test for proportions).
- How long to run the test.
- How to interpret the results.
- Suggest Relevant Datasets:
Act as a data science career coach. I want to build a portfolio project demonstrating skills in[specific skills or technologies, e.g., 'NLP for sentiment analysis using transformers', 'time series forecasting with Prophet', 'computer vision for image classification with PyTorch']
for a[{project goal, e.g., 'classifying movie reviews', 'predicting stock prices', 'identifying dog breeds'}]
. Can you suggest 3-5 relevant, publicly available datasets for this use case? Provide links if possible. - Identify Edge Cases for a Function:
Act as a software quality assurance engineer. Help me identify potential edge cases and boundary conditions for the following Python function that[function purpose, e.g., 'calculates the average of a list of numbers']
. Consider various input types (valid and invalid), empty inputs, very large inputs, inputs with special values (None, NaN), etc. Function:[paste your Python function here]
- Portfolio Project Ideas:
Act as a data science mentor. My background is in[{your background, e.g., 'marketing analytics', 'academic research in biology', 'software engineering with Java'}]
and I want to[career goal, e.g., 'transition into a Machine Learning Engineer role', 'specialize in MLOps', 'become a Data Scientist in the healthcare industry']
. I need to build 2-3 portfolio projects. Can you suggest specific project ideas that leverage my background, help me achieve my goal, and showcase expertise in[{your key skills to highlight, e.g., 'Python, SQL, model deployment', 'statistical analysis, R, bioinformatics tools', 'scalable systems, cloud platforms like AWS/Azure'}]
relevant to[{target industry or company type, e.g., 'tech startups', 'pharmaceutical companies', 'e-commerce businesses'}]
? - Learning Resource Suggestions:
Act as a data science learning advisor. I want to learn more about[{specific topic, e.g., 'Reinforcement Learning from Sutton & Barto', 'advanced SQL for data analysis', 'MLOps best practices for model deployment and monitoring', 'causal inference using DoWhy/CausalML'}]
. Please suggest the 3 best specific resources (e.g., books, online courses with links, influential research papers, practical blogs/tutorials). I prefer[specify resource type if any, e.g., 'interactive tutorials with code', 'in-depth video lectures', 'comprehensive textbooks']
. - Compare Time Complexity of Algorithms:
Act as a computer science instructor. Compare the time complexity (Big O notation) of the following two Python algorithms that aim to achieve the same task of[task description, e.g., 'finding if an element exists in a sorted list']
. Explain your reasoning for each.
Algorithm 1 ([e.g., linear search]
):[paste first Python function]
Algorithm 2 ([e.g., binary search]
):[paste second Python function]
XII. Communication & Ethical Considerations
Prompts for effectively communicating data science results and addressing ethical implications.
- Structuring a Report/Presentation:
Outline a structure for a technical report (or a 15-slide presentation) summarizing the findings of a data science project on[project topic, e.g., 'customer churn prediction model development']
for an audience of[e.g., 'technical peers and data science managers', 'non-technical business executives', 'a mixed audience']
. Highlight key sections and what each should cover. - Identifying Potential Bias in Models:
What are potential sources of bias (e.g., sampling bias, measurement bias, algorithmic bias) I should be aware of when building a model to predict[sensitive outcome, e.g., 'loan application approval', 'candidate suitability for a job', 'recidivism risk']
using features like[Feature X (e.g., 'zip code'), Feature Y (e.g., 'education level'), ...]
? How can I attempt to measure and mitigate these biases at different stages of the project (data collection, preprocessing, modeling, post-processing)? - Explaining Model Limitations:
My model for[task, e.g., predicting house prices]
achieved[performance metric, e.g., an R-squared of 0.85]
. How should I explain the limitations of this model to stakeholders, considering aspects like[e.g., 'data used for training (scope, recency)', 'unseen scenarios', 'potential for drift', 'features not included']
?