思路和理解
問題中心:評論數據星級建模
簡要思路:理解成京東淘寶商城的評論數據,解釋4.8星的指數怎麼來的,你對商品的一段評論對該等級有多大影響?
- 個人的習慣是大數據問題第四章單獨寫數據清洗,具體流程看羣中研究生數模的國獎論文。
- 首先要分析三個附件,提取有用的變量、刪除字段缺失的數據條、分類變量歸一化等,進一步處理等待更新。
-
基礎星級評定模型。
首先從數據中15個變量中篩選有效變量,分析每個變量的分佈,構建STAR_RATING關於投票數、消費類型、消費者來源等的關係。你們的模型可以是定性以及定量的。
該問題最終落腳在對三種產品的模式分析,簡單的excel 處理後每個隊都能出些工作量。亮點在模式挖掘的有效性及圖表的美觀上。 -
知識挖掘模型(自然語言處理)
第二問工作量較大。對於每種產品,要求分析①星級評定的主要影響指標(主成分分析,層次聚類、因子分析等模型),②每種模式基於時間的聲譽變化(時間序列分析),③確定基於文本的衡量標準(注意是標準,需要制定規則),來給出相應成功或者失敗的暗示。(如評論中某些詞語出現的頻率+時間序列的某種模式=某產品要涼)
。問題二可分開建立多個模型。 -
寫一份1-2頁的信,向市場總監提供你們的文本挖掘結果,要有數據支撐(圖表)。
M獎分界點:針對模型分析結果,給出改善某產品聲譽或者銷量的建議
該問題相對較難,大數據問題雖有吸引力,但因爲文本變量較關鍵,對於沒有自然語言處理經驗的同學可能難以做出成果,慎重考慮。
原題目翻譯
數據解釋
原文
2020 MCM Weekend 2
Problem C: A Wealth of Data
In the online marketplace it created, Amazon provides customers with an opportunity to rate and
review purchases. Individual ratings - called “star ratings” – allow purchasers to express their
level of satisfaction with a product using a scale of 1 (low rated, low satisfaction) to 5 (highly
rated, high satisfaction). Additionally, customers can submit text-based messages – called
“reviews” – that express further opinions and information about the product. Other customers
can submit ratings on these reviews as being helpful or not – called a “helpfulness rating” –
towards assisting their own product purchasing decision. Companies use these data to gain
insights into the markets in which they participate, the timing of that participation, and the
potential success of product design feature choices.
Sunshine Company is planning to introduce and sell three new products in the online
marketplace: a microwave oven, a baby pacifier, and a hair dryer. They have hired your team as
consultants to identify key patterns, relationships, measures, and parameters in past customersupplied ratings and reviews associated with other competing products to 1) inform their online
sales strategy and 2) identify potentially important design features that would enhance product
desirability. Sunshine Company has used data to inform sales strategies in the past, but they have
not previously used this particular combination and type of data. Of particular interest to
Sunshine Company are time-based patterns in these data, and whether they interact in ways that
will help the company craft successful products.
To assist you, Sunshine’s data center has provided you with three data files for this project:
hair_dryer.tsv, microwave.tsv, and pacifier.tsv. These data represent customer-supplied
ratings and reviews for microwave ovens, baby pacifiers, and hair dryers sold in the Amazon
marketplace over the time period(s) indicated in the data. A glossary of data label definitions is
provided as well. THE DATA FILES PROVIDED CONTAIN THE ONLY DATA YOU
SHOULD USE FOR THIS PROBLEM.
Requirements
- Analyze the three product data sets provided to identify, describe, and support with
mathematical evidence, meaningful quantitative and/or qualitative patterns, relationships,
measures, and parameters within and between star ratings, reviews, and helpfulness ratings that
will help Sunshine Company succeed in their three new online marketplace product offerings. - Use your analysis to address the following specific questions and requests from the Sunshine
Company Marketing Director:
a. Identify data measures based on ratings and reviews that are most informative for
Sunshine Company to track, once their three products are placed on sale in the online
marketplace.
b. Identify and discuss time-based measures and patterns within each data set that might
suggest that a product’s reputation is increasing or decreasing in the online marketplace.
c. Determine combinations of text-based measure(s) and ratings-based measures that best
indicate a potentially successful or failing product.d. Do specific star ratings incite more reviews? For example, are customers more likely to
write some type of review after seeing a series of low star ratings?
e. Are specific quality descriptors of text-based reviews such as ‘enthusiastic’,
‘disappointed’, and others, strongly associated with rating levels? - Write a one- to two-page letter to the Marketing Director of Sunshine Company summarizing
your team’s analysis and results. Include specific justification(s) for the result that your team
most confidently recommends to the Marketing Director.
Your submission should consist of:
One-page Summary Sheet
Table of Contents
One- to Two-page Letter
Your solution of no more than 20 pages, for a maximum of 24 pages with your summary
sheet, table of contents, and two-page letter.
Note: Reference List and any appendices do not count toward the page limit and should appear
after your completed solution. You should not make use of unauthorized images and materials
whose use is restricted by copyright laws. Ensure you cite the sources for your ideas and the
materials used in your report.
Glossary
Helpfulness Rating: an indication of how valuable a particular product review is when
making a decision whether or not to purchase that product.
Pacifier: a rubber or plastic soothing device, often nipple shaped, given to a baby to suck
or bite on.
Review: a written evaluation of a product.
Star Rating: a score given in a system that allows people to rate a product with a number
of stars.
Attachments: The Problem Datasets
Problem_C_Data.zip
The three data sets provided contain product user ratings and reviews extracted from the
Amazon Customer Reviews Dataset thru Amazon Simple Storage Service (Amazon S3).
hair_dryer.tsv
microwave.tsv
pacifier.tsvData Set Definitions: Each row represents data partitioned into the following columns.
● marketplace (string): 2 letter country code of the marketplace where the review was
written.
● customer_id (string): Random identifier that can be used to aggregate reviews written by
a single author.
● review_id (string): The unique ID of the review.
● product_id (string): The unique Product ID the review pertains to.
● product_parent (string): Random identifier that can be used to aggregate reviews for the
same product.
● product_title (string): Title of the product.
● product_category (string): The major consumer category for the product.
● star_rating (int): The 1-5 star rating of the review.
● helpful_votes (int): Number of helpful votes.
● total_votes (int): Number of total votes the review received.
● vine (string): Customers are invited to become Amazon Vine Voices based on the trust
that they have earned in the Amazon community for writing accurate and insightful
reviews. Amazon provides Amazon Vine members with free copies of products that have
been submitted to the program by vendors. Amazon doesn’t influence the opinions of
Amazon Vine members, nor do they modify or edit reviews.
● verified_purchase (string): A “Y” indicates Amazon verified that the person writing the
review purchased the product at Amazon and didn’t receive the product at a deep
discount.
● review_headline (string): The title of the review.
● review_body (string): The review text.
● review_date (bigint): The date the review was written.