[solved]-Assignment Asks Explore Data Quality Fictional Travel Company Assignment Multiple Steps Pl Q39060234
The assignment below asks you to explore data quality for afictional travel company. This assignment has multiple steps, soplease review these instructions carefully
- Review the information about the data.
- Read the questions carefully – they appear in the text andwithin the Open Response Assessment prompts.
- Draft your answers to the questions in the Open ResponseAssessment boxes below. Answers should be in sentence form.
Texpedia, an internet-based travel website, wants to analyze itscustomer data and get on the bandwagon of business analytics withyour help. For example, Texpedia wants to analyze if age andcustomer type (e.g., retail and corporate) explains the amount thata customer spends in traveling (amtspent). You realize that thedata quality issues can have tremendous impact on the results ofany analysis Texpedia may want conduct. Hence you first want toprofile the data that would be potentially used for analytics.
You are given two tables, CUSTOMER_DATA and CUSTOMER_TYPE, afragment of each table is shown below.
The CUSTOMER_DATA table has 1,500 customers; see Figure 1. Afragment of the data is shown below:
The description of different columns is as follows:
- Customer: The unique customer id for eachrecord.
- SSN: Social Security Number for each customer;this should be unique for every customer.
- Birthdate: Two digit month, two digit date,and two digit year for the birth date of every customer.
- Age: Current age of the customer.
- Region: The region that the customer islocated in.
- Region code: A code that is a numericrepresentation for each Region. This corresponds with the valuesfound in the region column.
- CredCardUser: If a customer uses a creditcard, this column is populated with a value of “1”, otherwise“0”.
- Income: The annual income of the customer indollars.
- Purchases: The number of purchases made by thecustomer.
- AmtSpent: The amount of money that a customerhas collectively spent on all purchases in dollars.
- CustomerType: The type of customer, that is,retail or corporate. Retail customers should have a value of “1”and corporate customers should have a value of “2”. These valuescorrespond with the CustomerType column in the CUSTOMER_TYPE table;in other words, this is a “foreign key” attribute.
The description of different columns is as follows:
- Customer: The unique customer id for eachrecord.
- SSN: Social Security Number for each customer;this should be unique for every customer.
- Birthdate: Two digit month, two digit date,and two digit year for the birth date of every customer.
- Age: Current age of the customer.
- Region: The region that the customer islocated in.
- Region code: A code that is a numericrepresentation for each Region. This corresponds with the valuesfound in the region column.
- CredCardUser: If a customer uses a creditcard, this column is populated with a value of “1”, otherwise“0”.
- Income: The annual income of the customer indollars.
- Purchases: The number of purchases made by thecustomer.
- AmtSpent: The amount of money that a customerhas collectively spent on all purchases in dollars.
- CustomerType: The type of customer, that is,retail or corporate. Retail customers should have a value of “1”and corporate customers should have a value of “2”. These valuescorrespond with the CustomerType column in the CUSTOMER_TYPE table;in other words, this is a “foreign key” attribute.
The CUSTOMER_TYPE table has different types of customers; seeFigure 2. Currently there are only two types of customers, retailand corporate.
Figure 1: CUSTOMER_DATA table (truncated) Customer SSN Birthdate Age Amtspent CustomerType 539-84-9599 10/26/44 67 East 62900 4 2080 444-05-4079 01/01/32 67 West 4 23300 418-18-5649 08/17/73 25 East 48700 3990 065-63-3311 08/02/47 51 West 137600 920 059-58-9566 10/03/48 50 East 101400 1000 443-13-8685 03/24/60 East 139700 550 638-89-7231 12/02/43 55 South 50900 3 1400 202-94-6453 11/08/74 24 South 50500 O 266-29-0308 09/28/67 31 151400 910 943858301 07/05/65 33 88300 2 1080 Figure 2: CUSTOMER_TYPE table Description CustomerType Retail Corporate The description of different columns is as follows: • CustomerType: A unique identifier for each type of customer. This is the column being referenced by the CustomerType column in the CUSTOMER_DATA table. • Description: A text description that describes the type of customer. Texpedia Assignment Questions Below are the multiple questions you will need to answer for the assignment. Question 1 You want to get started with analyzing data in the column SSN. The output below indicates different formats in which SSN is specified. Based on the output below, what would your data quality assessment be? Craft your answer in a brief statement and be sure to refer to the figure. Figure 3 Value Count 999-99-9999 1311.0 87.40% 999999999 181.00 12.07% 99999999 7.00 0.47% 9999999 1.00 0.07% Question 2 The output below shows two formats for specifying Birthdate. Based on the presented column analysis for Birthdate, what is your assessment of data quality? Figure 7 Value Count 99/99/99 1482.00 98.80% 9999 18.00 1.20% Question 3 The value column analysis of the column Age is shown in the table below. Based on the value column analysis on Age in the CUSTOMER_DATA table, what is your assessment of data quality? Figure 10 Value Count 1488.00 99.20% -99 12.00 0.80% Question 4 A region (e.g., “East”) is associated with a specific region code (e.g., 2). However, the association between region and region code may not be one- to-one; for example, both “East” and “E” (region) correspond to “2” (region code). A “column correlation analysis” between the columns “Region” and “RegionCode” in the CUSTOMER_DATA table reveals inconsistent usage of region codes in the database. It is apparent from the analysis that the same region code has been encoded differently. For example, region encoded as 1 has been referred to as N on 48 occasions Nth on 18 occasions and North on 280 occasions. Figure 13 below displays a mapping of region codes to regions. How can different encoding of the same information impact the usage of that data? Figure 13 East 368 366 South 19 West North 267 280 1 48 Nrth Question 5 You suspect that the values of “CustomerType” in CUSTOMER_DATA and CUSTOMER_TYPE tables may not be consistent. Based on column content analysis (see Figure 14, Figure 15, and Figure 16), it seems that that 23.87 of the customer types used in CUSTOMER_DATA is not even present in CUSTOMER_TYPE table. Based on the presented column content analysis, how can it impact the usage of this data? Figure 14 Value 50% 0% 25% 75% 100% CUSTOMER_DATA 23.87% 76.13% 76.13% CUSTOMER_TYPE 100% not matching matching Figure 15 Analysis Type CUSTOMER DATA CUSTOMER TYPE SEMatch 76.139 100.00 Not Match 23.87% 0.009 Match 1142 Not Match Rows Figure 16 CustomerType Show transcribed image text Figure 1: CUSTOMER_DATA table (truncated) Customer SSN Birthdate Age Amtspent CustomerType 539-84-9599 10/26/44 67 East 62900 4 2080 444-05-4079 01/01/32 67 West 4 23300 418-18-5649 08/17/73 25 East 48700 3990 065-63-3311 08/02/47 51 West 137600 920 059-58-9566 10/03/48 50 East 101400 1000 443-13-8685 03/24/60 East 139700 550 638-89-7231 12/02/43 55 South 50900 3 1400 202-94-6453 11/08/74 24 South 50500 O 266-29-0308 09/28/67 31 151400 910 943858301 07/05/65 33 88300 2 1080
Figure 2: CUSTOMER_TYPE table Description CustomerType Retail Corporate The description of different columns is as follows: • CustomerType: A unique identifier for each type of customer. This is the column being referenced by the CustomerType column in the CUSTOMER_DATA table. • Description: A text description that describes the type of customer. Texpedia Assignment Questions Below are the multiple questions you will need to answer for the assignment. Question 1 You want to get started with analyzing data in the column SSN. The output below indicates different formats in which SSN is specified. Based on the output below, what would your data quality assessment be? Craft your answer in a brief statement and be sure to refer to the figure.
Figure 3 Value Count 999-99-9999 1311.0 87.40% 999999999 181.00 12.07% 99999999 7.00 0.47% 9999999 1.00 0.07% Question 2 The output below shows two formats for specifying Birthdate. Based on the presented column analysis for Birthdate, what is your assessment of data quality? Figure 7 Value Count 99/99/99 1482.00 98.80% 9999 18.00 1.20%
Question 3 The value column analysis of the column Age is shown in the table below. Based on the value column analysis on Age in the CUSTOMER_DATA table, what is your assessment of data quality? Figure 10 Value Count 1488.00 99.20% -99 12.00 0.80% Question 4 A region (e.g., “East”) is associated with a specific region code (e.g., 2). However, the association between region and region code may not be one- to-one; for example, both “East” and “E” (region) correspond to “2” (region code). A “column correlation analysis” between the columns “Region” and “RegionCode” in the CUSTOMER_DATA table reveals inconsistent usage of region codes in the database. It is apparent from the analysis that the same region code has been encoded differently. For example, region encoded as 1 has been referred to as N on 48 occasions Nth on 18 occasions and North on 280 occasions. Figure 13 below displays a mapping of region codes to regions. How can different encoding of the same information impact the usage of that data? Figure 13
East 368 366 South 19 West North 267 280 1 48 Nrth
Question 5 You suspect that the values of “CustomerType” in CUSTOMER_DATA and CUSTOMER_TYPE tables may not be consistent. Based on column content analysis (see Figure 14, Figure 15, and Figure 16), it seems that that 23.87 of the customer types used in CUSTOMER_DATA is not even present in CUSTOMER_TYPE table. Based on the presented column content analysis, how can it impact the usage of this data? Figure 14 Value 50% 0% 25% 75% 100% CUSTOMER_DATA 23.87% 76.13% 76.13% CUSTOMER_TYPE 100% not matching matching
Figure 15 Analysis Type CUSTOMER DATA CUSTOMER TYPE SEMatch 76.139 100.00 Not Match 23.87% 0.009 Match 1142 Not Match Rows Figure 16 CustomerType
Expert Answer
Answer to The assignment below asks you to explore data quality for a fictional travel company. This assignment has multiple steps… . . .
OR