Enhance dataset analysis in assignment.ipynb by PARTHDESHMUKH2005 · Pull Request #731 · microsoft/Data-Science-For-Beginners

PARTHDESHMUKH2005 · 2026-01-19T16:13:52Z

Added code to analyze dataset, added some missing features, column types, and three trip statistics.

Added code to analyze dataset, including shape, column types, and trip statistics.

leestott · 2026-02-17T07:57:07Z

@PARTHDESHMUKH2005 please approve the Microsoft CLA

PARTHDESHMUKH2005 · 2026-02-18T08:10:07Z

hey, could you give me the instructions on how to approve the CLA, I did not find any :(

leestott · 2026-02-27T07:22:25Z

@PARTHDESHMUKH2005 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Copilot

Pull request overview

This pull request enhances the exploratory data analysis capabilities in the NYC Taxi dataset assignment notebook by adding code to analyze the dataset structure, perform datetime conversions, engineer time-based features, and display trip statistics. The changes aim to provide students with a working example of dataset exploration and feature engineering techniques.

Changes:

Added code cell with dataset analysis including shape, column types, datetime conversions, feature engineering (trip duration, pickup hour/day/month), and statistical summaries
Added output cell showing expected results from the analysis code
Updated file structure (added closing brace on line 187)

Comments suppressed due to low confidence (2)

4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb:157

There is a trailing space after the closing quote on this line. This should be removed to maintain clean formatting in the notebook JSON structure.

Average trip duration (minutes): 13.548333333333334\r\n"

4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb:121

These empty string entries in the source array break the logical flow of the code. Line 116 appears between the print statements and the datetime conversion comment, and line 118 appears after the datetime conversion comment. In Jupyter notebook JSON format, each line of code should be a separate string in the source array, but empty strings that don't represent actual newlines or code should be removed. Consider removing these entries or replacing them with proper newline characters if intentional spacing is needed.

"\r\n"
"# Convert datetime columns to proper datetime format\r\n"

"df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])\r\n"
"df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])\r\n"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T07:50:07Z

4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb

+"df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour\r\n"
+"df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek\r\n"
+"df['pickup_month'] = df['tpep_pickup_datetime'].dt.month\r\n"
+"#interesting statistics\r\n"


The comment is missing a space after the hash symbol. According to PEP 8 and the conventions used throughout this codebase (see lines 112, 117, 122, 124), comments should have a space after the '#' symbol. Change '#interesting statistics' to '# interesting statistics' or consider a more descriptive comment like '# Display trip statistics' to match the professional tone of other comments in the notebook.

This issue also appears in the following locations of the same file:

line 157

line 116

Suggested change

"#interesting statistics\r\n"

"# Display trip statistics\r\n"

Copilot · 2026-02-27T07:50:07Z

4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb

+"# Basic information about the dataset\r\n"
+print("Dataset shape:", df.shape)\r\n"
+print("\nColumn names and types:")\r\n"
+print(df.dtypes)\r\n"
+"\r\n"
+"# Convert datetime columns to proper datetime format\r\n"
+
+"df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])\r\n"
+"df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])\r\n"
+
+"# Calculate trip duration in minutes\r\n"
+"df['trip_duration_minutes'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60\r\n"
+"# Extract time-based features\r\n"
+"df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour\r\n"
+"df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek\r\n"
+"df['pickup_month'] = df['tpep_pickup_datetime'].dt.month\r\n"
+"#interesting statistics\r\n"
+"print("\nAverage trip distance:", df['trip_distance'].mean())\r\n"
+"print("Average fare amount:", df['fare_amount'].mean())\r\n"
+"print("Average trip duration (minutes):", df['trip_duration_minutes'].mean())\r\n"


The string literals in this code use curly/smart quotes ("") instead of straight quotes (""). Python requires straight quotes for string literals. All print statements and comments need to use straight double quotes (") or single quotes (') instead of the curly quotes currently used. This will cause a SyntaxError when the cell is executed.

Copilot · 2026-02-27T07:50:08Z

4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb

+"Dataset shape: (200, 18)\r\n
+Column names and types:\r\n
+VendorID                  float64\r\n
+tpep_pickup_datetime       object\r\n
+tpep_dropoff_datetime      object\r\n
+passenger_count           float64\r\n
+trip_distance             float64\r\n
+RatecodeID                float64\r\n
+store_and_fwd_flag         object\r\n
+PULocationID                int64\r\n
+DOLocationID                int64\r\n
+payment_type              float64\r\n
+fare_amount               float64\r\n
+extra                     float64\r\n
+mta_tax                   float64\r\n
+tip_amount                float64\r\n
+tolls_amount              float64\r\n
+improvement_surcharge     float64\r\n
+total_amount              float64\r\n
+congestion_surcharge      float64\r\n
+dtype: object\r\n"
+"Average trip distance: 2.9131500000000003\r\n
+Average fare amount: 12.2675\r\n
+Average trip duration (minutes): 13.548333333333334\r\n" 
+],


The output cell format is incorrect for Jupyter notebook JSON. The outputs array should contain dictionaries with proper structure (output_type, name, text fields), not raw strings. The current format with strings like "Dataset shape: (200, 18)\r\n" will not render correctly in Jupyter. The output should follow the standard Jupyter notebook format with each output item being a dictionary containing output_type (e.g., "stream"), name (e.g., "stdout"), and text as an array of strings.

Suggested change

"Dataset shape: (200, 18)\r\n

Column names and types:\r\n

VendorID float64\r\n

tpep_pickup_datetime object\r\n

tpep_dropoff_datetime object\r\n

passenger_count float64\r\n

trip_distance float64\r\n

RatecodeID float64\r\n

store_and_fwd_flag object\r\n

PULocationID int64\r\n

DOLocationID int64\r\n

payment_type float64\r\n

fare_amount float64\r\n

extra float64\r\n

mta_tax float64\r\n

tip_amount float64\r\n

tolls_amount float64\r\n

improvement_surcharge float64\r\n

total_amount float64\r\n

congestion_surcharge float64\r\n

dtype: object\r\n"

"Average trip distance: 2.9131500000000003\r\n

Average fare amount: 12.2675\r\n

Average trip duration (minutes): 13.548333333333334\r\n"

],

{

"output_type": "stream",

"name": "stdout",

"text": [

"Dataset shape: (200, 18)\r\n",

"Column names and types:\r\n",

"VendorID float64\r\n",

"tpep_pickup_datetime object\r\n",

"tpep_dropoff_datetime object\r\n",

"passenger_count float64\r\n",

"trip_distance float64\r\n",

"RatecodeID float64\r\n",

"store_and_fwd_flag object\r\n",

"PULocationID int64\r\n",

"DOLocationID int64\r\n",

"payment_type float64\r\n",

"fare_amount float64\r\n",

"extra float64\r\n",

"mta_tax float64\r\n",

"tip_amount float64\r\n",

"tolls_amount float64\r\n",

"improvement_surcharge float64\r\n",

"total_amount float64\r\n",

"congestion_surcharge float64\r\n",

"dtype: object\r\n",

"Average trip distance: 2.9131500000000003\r\n",

"Average fare amount: 12.2675\r\n",

"Average trip duration (minutes): 13.548333333333334\r\n"

]

}

],

leestott · 2026-02-27T08:15:57Z

@copilot open a new pull request to apply changes based on the comments in this thread

leestott · 2026-02-27T08:18:26Z

Closing this as your trying to submit your assignment to the master blank assignment

Enhance dataset analysis in assignment.ipynb

c1fbbf4

Added code to analyze dataset, including shape, column types, and trip statistics.

artur19simonyan81-rgb approved these changes Jan 21, 2026

View reviewed changes

leestott requested a review from Copilot February 27, 2026 07:46

Copilot started reviewing on behalf of leestott February 27, 2026 07:47 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

leestott closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance dataset analysis in assignment.ipynb#731

Enhance dataset analysis in assignment.ipynb#731
PARTHDESHMUKH2005 wants to merge 1 commit intomicrosoft:mainfrom
PARTHDESHMUKH2005:patch-1

PARTHDESHMUKH2005 commented Jan 19, 2026

Uh oh!

leestott commented Feb 17, 2026

Uh oh!

PARTHDESHMUKH2005 commented Feb 18, 2026

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	"#interesting statistics\r\n"
	"# Display trip statistics\r\n"

Conversation

PARTHDESHMUKH2005 commented Jan 19, 2026

Uh oh!

leestott commented Feb 17, 2026

Uh oh!

PARTHDESHMUKH2005 commented Feb 18, 2026

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

leestott commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants