All files needed for the assignment including instructions have been uploaded below.
Data Transformation and Automation
Project Brief
Scenario
In the highly competitive movie streaming services market, your client has asked for help
with enriching their data with publicly available data. Using a dataset from your client
containing different data about movies, you are tasked with scraping online publicly
available data from the Internet Movie Database (IMDb), one of the most popular
websites that contains large amounts of data on movies. You’ll need to transform the
scraped data into a structured format and integrate it with your client’s data to come up
with an enriched dataset.
Key Tasks
This project has two parts with multiple tasks and separate deliverables for each part.
Read each set of instructions carefully.
PART A: Data Gathering, Transformation, and Enrichment.
Download and unzip the needed file to work on Project 03.zip
The project zip file includes three files:
•
•
•
Project_3_Part_A.ipynb
TopVoted_500_Movies_HTML.txt
Movies.csv
Perform data gathering using web scraping to enrich your client’s dataset (Movies.csv )
containing top voted 500 movies released between 2018 and 2020. The dataset
includes the following fields:
•
•
•
•
•
•
movie_id: alphanumeric unique identifier of the title.
originalTitle: original title, in the original language.
runtimeMinutes: the movie runtime in minutes.
description: a short description of the movie.
ratingCategory: movie rating for empowering families to make informed
movie choices.
genres: includes up to three genres associated with the title.
Rename Project_3_Part_A.ipynb Jupyter Notebook by adding your last name to the
filename. Edit the code in the notebook to complete the following tasks:
1. Conduct Data Gathering:
• Scrape this IMDb webpageLinks to an external site. of movies released
between 2018 and 2020, sorted by votes in descending order. Pull
movie_id, rank, title, year, rating, and votes for the top 500 movies
sorted by user number of votes in descending order.
• Transform the scraped data to a structured format and write it to a CSV
file (name it IMDb_TopVoted.csv).
Supporting Materials
Web ScrapingLinks to an external site. — Review screens 2 to 12 for how to extract
data from a web page.
Errors and ExceptionsLinks to an external site. — Use this resource for how to handle
runtime errors with the Try and Except statement in Python.
CSV File Reading and WritingLinks to an external site. — Review this resource for how
to write your output to a CSV file.
2. Conduct Data Enrichment:
• Import the Movies.csv file to a pandas DataFrame called df1.
• Import the scraped data from the IMDb_TopVoted.csv file to a pandas
DataFrame called df2.
• Implement data cleansing and transformation to convert the columns
datatype for the df2
o For example, movie runtime should be a numeric datatype.
• Enrich the given dataset (df1) by merging it to the scraped data (df2).
• Rearrange the dataset fields to be listed in the following order:
movie_id, rank , title , originalTitle , description , year
, votes , rating , runtimeMinutes , ratingCategory , genres
Export the enriched dataset to a CSV file:
o Use the following naming convention:
Project_3_Part_A_Lastname.csv
o
•
PART B: Automate Data Transformation and Integration.
Use Alteryx to automate the process that you applied in Part A to clean, transform, and
integrate the data.
1. Create Alteryx workflow to:
a. Import IMDb_TopVoted.csv dataset you created in Part A.
b.
c.
d.
e.
Do the necessary data cleansing and transformation.
Import Movies.csv dataset.
Merge the two datasets to obtain the enriched dataset.
Sort the enriched dataset by rank in ascending order, and rearrange the
dataset fields to be listed as the following: movie_id, rank , title
, originalTitle , description , year , votes , rating , runtimeMinutes
, ratingCategory , genres
f. Export the enriched dataset to CSV file:
• Use the following naming convention:
Project_3_Part_B_Lastname.csv
2. Report in a Word document, a brief description of the following:
• What data was used to enrich the client’s data?
• Describe the data cleaning and transformation that was implemented.
What to Submit:
PART A: Upload the following 4 files:
•
•
•
•
The edited Jupyter notebook in .IPYNB format with annotations
that explain and document your work.
A copy of the Jupyter notebook in .HTML format.
CSV file for the scraped data (IMDb_TopVoted_Lastname.csv).
CSV file for the enriched dataset
(Project_3_Part_A_Lastname.csv).
PART B: Upload the following 3 files:
•
•
•
Alteryx file for the workflow
(Project_3_Part_B_Lastname.yxmd).
CSV file for the output enriched dataset
(Project_3_Part_B_Lastname.csv).
Word document with written description
(Project_3_Part_B_Lastname.doc).
Note: do not submit as a zip file.
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more