Enhance dataset analysis in assignment.ipynb#731
Enhance dataset analysis in assignment.ipynb#731PARTHDESHMUKH2005 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Added code to analyze dataset, including shape, column types, and trip statistics.
|
@PARTHDESHMUKH2005 please approve the Microsoft CLA |
|
hey, could you give me the instructions on how to approve the CLA, I did not find any :( |
|
@PARTHDESHMUKH2005 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information. @microsoft-github-policy-service agree [company="{your company}"] (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer. |
There was a problem hiding this comment.
Pull request overview
This pull request enhances the exploratory data analysis capabilities in the NYC Taxi dataset assignment notebook by adding code to analyze the dataset structure, perform datetime conversions, engineer time-based features, and display trip statistics. The changes aim to provide students with a working example of dataset exploration and feature engineering techniques.
Changes:
- Added code cell with dataset analysis including shape, column types, datetime conversions, feature engineering (trip duration, pickup hour/day/month), and statistical summaries
- Added output cell showing expected results from the analysis code
- Updated file structure (added closing brace on line 187)
Comments suppressed due to low confidence (2)
4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb:157
- There is a trailing space after the closing quote on this line. This should be removed to maintain clean formatting in the notebook JSON structure.
Average trip duration (minutes): 13.548333333333334\r\n"
4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb:121
- These empty string entries in the source array break the logical flow of the code. Line 116 appears between the print statements and the datetime conversion comment, and line 118 appears after the datetime conversion comment. In Jupyter notebook JSON format, each line of code should be a separate string in the source array, but empty strings that don't represent actual newlines or code should be removed. Consider removing these entries or replacing them with proper newline characters if intentional spacing is needed.
"\r\n"
"# Convert datetime columns to proper datetime format\r\n"
"df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])\r\n"
"df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])\r\n"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour\r\n" | ||
| "df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek\r\n" | ||
| "df['pickup_month'] = df['tpep_pickup_datetime'].dt.month\r\n" | ||
| "#interesting statistics\r\n" |
There was a problem hiding this comment.
The comment is missing a space after the hash symbol. According to PEP 8 and the conventions used throughout this codebase (see lines 112, 117, 122, 124), comments should have a space after the '#' symbol. Change '#interesting statistics' to '# interesting statistics' or consider a more descriptive comment like '# Display trip statistics' to match the professional tone of other comments in the notebook.
This issue also appears in the following locations of the same file:
- line 157
- line 116
| "#interesting statistics\r\n" | |
| "# Display trip statistics\r\n" |
| "# Basic information about the dataset\r\n" | ||
| print("Dataset shape:", df.shape)\r\n" | ||
| print("\nColumn names and types:")\r\n" | ||
| print(df.dtypes)\r\n" | ||
| "\r\n" | ||
| "# Convert datetime columns to proper datetime format\r\n" | ||
|
|
||
| "df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])\r\n" | ||
| "df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])\r\n" | ||
|
|
||
| "# Calculate trip duration in minutes\r\n" | ||
| "df['trip_duration_minutes'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60\r\n" | ||
| "# Extract time-based features\r\n" | ||
| "df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour\r\n" | ||
| "df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek\r\n" | ||
| "df['pickup_month'] = df['tpep_pickup_datetime'].dt.month\r\n" | ||
| "#interesting statistics\r\n" | ||
| "print("\nAverage trip distance:", df['trip_distance'].mean())\r\n" | ||
| "print("Average fare amount:", df['fare_amount'].mean())\r\n" | ||
| "print("Average trip duration (minutes):", df['trip_duration_minutes'].mean())\r\n" |
There was a problem hiding this comment.
The string literals in this code use curly/smart quotes ("") instead of straight quotes (""). Python requires straight quotes for string literals. All print statements and comments need to use straight double quotes (") or single quotes (') instead of the curly quotes currently used. This will cause a SyntaxError when the cell is executed.
| "Dataset shape: (200, 18)\r\n | ||
| Column names and types:\r\n | ||
| VendorID float64\r\n | ||
| tpep_pickup_datetime object\r\n | ||
| tpep_dropoff_datetime object\r\n | ||
| passenger_count float64\r\n | ||
| trip_distance float64\r\n | ||
| RatecodeID float64\r\n | ||
| store_and_fwd_flag object\r\n | ||
| PULocationID int64\r\n | ||
| DOLocationID int64\r\n | ||
| payment_type float64\r\n | ||
| fare_amount float64\r\n | ||
| extra float64\r\n | ||
| mta_tax float64\r\n | ||
| tip_amount float64\r\n | ||
| tolls_amount float64\r\n | ||
| improvement_surcharge float64\r\n | ||
| total_amount float64\r\n | ||
| congestion_surcharge float64\r\n | ||
| dtype: object\r\n" | ||
| "Average trip distance: 2.9131500000000003\r\n | ||
| Average fare amount: 12.2675\r\n | ||
| Average trip duration (minutes): 13.548333333333334\r\n" | ||
| ], |
There was a problem hiding this comment.
The output cell format is incorrect for Jupyter notebook JSON. The outputs array should contain dictionaries with proper structure (output_type, name, text fields), not raw strings. The current format with strings like "Dataset shape: (200, 18)\r\n" will not render correctly in Jupyter. The output should follow the standard Jupyter notebook format with each output item being a dictionary containing output_type (e.g., "stream"), name (e.g., "stdout"), and text as an array of strings.
| "Dataset shape: (200, 18)\r\n | |
| Column names and types:\r\n | |
| VendorID float64\r\n | |
| tpep_pickup_datetime object\r\n | |
| tpep_dropoff_datetime object\r\n | |
| passenger_count float64\r\n | |
| trip_distance float64\r\n | |
| RatecodeID float64\r\n | |
| store_and_fwd_flag object\r\n | |
| PULocationID int64\r\n | |
| DOLocationID int64\r\n | |
| payment_type float64\r\n | |
| fare_amount float64\r\n | |
| extra float64\r\n | |
| mta_tax float64\r\n | |
| tip_amount float64\r\n | |
| tolls_amount float64\r\n | |
| improvement_surcharge float64\r\n | |
| total_amount float64\r\n | |
| congestion_surcharge float64\r\n | |
| dtype: object\r\n" | |
| "Average trip distance: 2.9131500000000003\r\n | |
| Average fare amount: 12.2675\r\n | |
| Average trip duration (minutes): 13.548333333333334\r\n" | |
| ], | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "Dataset shape: (200, 18)\r\n", | |
| "Column names and types:\r\n", | |
| "VendorID float64\r\n", | |
| "tpep_pickup_datetime object\r\n", | |
| "tpep_dropoff_datetime object\r\n", | |
| "passenger_count float64\r\n", | |
| "trip_distance float64\r\n", | |
| "RatecodeID float64\r\n", | |
| "store_and_fwd_flag object\r\n", | |
| "PULocationID int64\r\n", | |
| "DOLocationID int64\r\n", | |
| "payment_type float64\r\n", | |
| "fare_amount float64\r\n", | |
| "extra float64\r\n", | |
| "mta_tax float64\r\n", | |
| "tip_amount float64\r\n", | |
| "tolls_amount float64\r\n", | |
| "improvement_surcharge float64\r\n", | |
| "total_amount float64\r\n", | |
| "congestion_surcharge float64\r\n", | |
| "dtype: object\r\n", | |
| "Average trip distance: 2.9131500000000003\r\n", | |
| "Average fare amount: 12.2675\r\n", | |
| "Average trip duration (minutes): 13.548333333333334\r\n" | |
| ] | |
| } | |
| ], |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
Closing this as your trying to submit your assignment to the master blank assignment |
Added code to analyze dataset, added some missing features, column types, and three trip statistics.