Category: Python

  • Assignment: Halloween Candy Exploring Candy Preferences in the United States “Analyzing Survey Data on Halloween Candy Preferences” Data Cleaning and Manipulation Cleaning Candy Data: Removing Duplicates and Unnecessary Columns ENTS, Unnamed: 113, Click Coordinates (x, y) Cleaning and Preparing Data for Halloween Candy Survey Analysis Data Cleaning and Standardization: Country Variable Assignment 3: Cleaning and Binning Data Cleaning and Binning Data “Data Cleaning and Feature Engineering in Pandas: Preparing a Candy Dataset for Machine Learning”

    Assignment: Halloween Candy
    Name:
    Overview
    In this assignment, your job will be to clean and wrangle data from a survey of Halloween candy to prepare it for a machine learning project.
    Once you have completed all the tasks in the notebook, save your notebook as halloween_candy, and verify that all the tests pass in CodeGrade.
    For those of you that are more comfortable with using Pandas, I have also provided extra optional data analysis questions at the end of the assignment for more practice.
    Data Set
    The data set that we will be using is the 2017 Halloween Candy Hierarchy data set as discussed in this boingboing article. You can also read more about the data in the Science Creative Quarterly.
    The following are the rating instructions from the survey:
    Basically, consider that feeling you get when you receive this item in your Halloween haul. Does it make you really happy (JOY)? Or is it something that you automatically place in the junk pile (DESPAIR)? MEH for indifference, and you can leave blank if you have no idea what the item is.Note that the original data set has been slightly altered from its original state, and if you wanted to perform any analysis for future projects, you would need to download the data directly from the links above.
    This data is a great example of a messy data set, especially since they allowed respondents to enter text for a number of the fields. Also, note that some of the comments in the file might be considered inappropriate to some readers but cleaning this type of data is normal in a lot of data science projects.
    Note
    Show Work
    Remember that you must show your work. Students submissions are spot checked manually throughout the term to verify that they are not hard coding the answer from looking only in the file or in CodeGrade’s expected output. If this is seen, the student’s answer will be manually marked wrong and their grade will be changed to reflect this.
    For example, if the answer to Q1, the mean of a specific column, is 22:
    # correct way
    Q1 = df['column_name'].mean()
    # incorrect way
    Q1 = 22
    Our End Goal
    Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person’s gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.
    Initial Import & Exploration
    In [18]:
    # initial importsimport pandas as pdimport numpy as np# Do not change this option; This allows the CodeGrade auto grading to function correctlypd.set_option(‘display.max_columns’, 20)Let’s start by importing our data and creating a DataFrame called candy. We need to include encoding=’iso-8859-1′ during the import because there are special characters in the data that Pandas doesn’t recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is utf-8, so changing the encoding allows Pandas to recognize those special characters.
    Run the following code, with the encoding argument, and it should import correctly.
    In [19]:
    # read_csv with iso-8859-1 encoding; using latin-1 would also work herecandy_full = pd.read_csv(‘candy.csv’, encoding=‘iso-8859-1’)# copy to new DF so that we can have a copy of the original import if neededcandy = candy_full.copy()Let’s take a brief look at the data by using head().
    In [20]:
    # first five rowscandy.head()Out[20]:
    Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
    090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
    190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
    290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
    390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
    490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
    5 rows × 120 columns
    Next, run the following code to see information about the DataFrame.
    In [21]:
    # check info about the DataFramecandy.info()
    RangeIndex: 2479 entries, 0 to 2478
    Columns: 120 entries, Internal ID to Click Coordinates (x, y)
    dtypes: float64(4), int64(1), object(115)
    memory usage: 2.3+ MB
    Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation: “If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.”
    We can make the columns display by setting the max_cols argument equal to the number of columns in the data set.
    In [22]:
    # check info, set max_colscandy.info(max_cols=120)
    RangeIndex: 2479 entries, 0 to 2478
    Data columns (total 120 columns):
    # Column Non-Null Count Dtype — —— ————– —– 0 Internal ID 2479 non-null int64 1 Q1: GOING OUT? 2368 non-null object 2 Q2: GENDER 2437 non-null object 3 Q3: AGE 2394 non-null object 4 Q4: COUNTRY 2414 non-null object 5 Q5: STATE, PROVINCE, COUNTY, ETC 2377 non-null object 6 Q6 | 100 Grand Bar 1728 non-null object 7 Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) 1741 non-null object 8 Q6 | Any full-sized candy bar 1803 non-null object 9 Q6 | Black Jacks 1517 non-null object 10 Q6 | Bonkers (the candy) 1482 non-null object 11 Q6 | Bonkers (the board game) 1469 non-null object 12 Q6 | Bottle Caps 1709 non-null object 13 Q6 | Box’o’Raisins 1787 non-null object 14 Q6 | Broken glow stick 1767 non-null object 15 Q6 | Butterfinger 1793 non-null object 16 Q6 | Cadbury Creme Eggs 1792 non-null object 17 Q6 | Candy Corn 1796 non-null object 18 Q6 | Candy that is clearly just the stuff given out for free at restaurants 1784 non-null object 19 Q6 | Caramellos 1723 non-null object 20 Q6 | Cash, or other forms of legal tender 1795 non-null object 21 Q6 | Chardonnay 1732 non-null object 22 Q6 | Chick-o-Sticks (we donÕt know what that is) 1528 non-null object 23 Q6 | Chiclets 1764 non-null object 24 Q6 | Coffee Crisp 1621 non-null object 25 Q6 | Creepy Religious comics/Chick Tracts 1771 non-null object 26 Q6 | Dental paraphenalia 1783 non-null object 27 Q6 | Dots 1746 non-null object 28 Q6 | Dove Bars 1773 non-null object 29 Q6 | Fuzzy Peaches 1652 non-null object 30 Q6 | Generic Brand Acetaminophen 1744 non-null object 31 Q6 | Glow sticks 1778 non-null object 32 Q6 | Goo Goo Clusters 1595 non-null object 33 Q6 | Good N’ Plenty 1741 non-null object 34 Q6 | Gum from baseball cards 1759 non-null object 35 Q6 | Gummy Bears straight up 1778 non-null object 36 Q6 | Hard Candy 1780 non-null object 37 Q6 | Healthy Fruit 1781 non-null object 38 Q6 | Heath Bar 1763 non-null object 39 Q6 | Hershey’s Dark Chocolate 1802 non-null object 40 Q6 | HersheyÕs Milk Chocolate 1803 non-null object 41 Q6 | Hershey’s Kisses 1797 non-null object 42 Q6 | Hugs (actual physical hugs) 1762 non-null object 43 Q6 | Jolly Rancher (bad flavor) 1781 non-null object 44 Q6 | Jolly Ranchers (good flavor) 1780 non-null object 45 Q6 | JoyJoy (Mit Iodine!) 1449 non-null object 46 Q6 | Junior Mints 1777 non-null object 47 Q6 | Senior Mints 1532 non-null object 48 Q6 | Kale smoothie 1731 non-null object 49 Q6 | Kinder Happy Hippo 1528 non-null object 50 Q6 | Kit Kat 1801 non-null object 51 Q6 | LaffyTaffy 1739 non-null object 52 Q6 | LemonHeads 1745 non-null object 53 Q6 | Licorice (not black) 1789 non-null object 54 Q6 | Licorice (yes black) 1790 non-null object 55 Q6 | Lindt Truffle 1757 non-null object 56 Q6 | Lollipops 1784 non-null object 57 Q6 | Mars 1750 non-null object 58 Q6 | Maynards 1450 non-null object 59 Q6 | Mike and Ike 1746 non-null object 60 Q6 | Milk Duds 1782 non-null object 61 Q6 | Milky Way 1787 non-null object 62 Q6 | Regular M&Ms 1800 non-null object 63 Q6 | Peanut M&MÕs 1804 non-null object 64 Q6 | Blue M&M’s 1748 non-null object 65 Q6 | Red M&M’s 1746 non-null object 66 Q6 | Green Party M&M’s 1711 non-null object 67 Q6 | Independent M&M’s 1662 non-null object 68 Q6 | Abstained from M&M’ing. 1532 non-null object 69 Q6 | Minibags of chips 1751 non-null object 70 Q6 | Mint Kisses 1699 non-null object 71 Q6 | Mint Juleps 1664 non-null object 72 Q6 | Mr. Goodbar 1735 non-null object 73 Q6 | Necco Wafers 1731 non-null object 74 Q6 | Nerds 1752 non-null object 75 Q6 | Nestle Crunch 1777 non-null object 76 Q6 | Now’n’Laters 1657 non-null object 77 Q6 | Peeps 1765 non-null object 78 Q6 | Pencils 1766 non-null object 79 Q6 | Pixy Stix 1753 non-null object 80 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 1722 non-null object 81 Q6 | ReeseÕs Peanut Butter Cups 1796 non-null object 82 Q6 | Reese’s Pieces 1784 non-null object 83 Q6 | Reggie Jackson Bar 1460 non-null object 84 Q6 | Rolos 1761 non-null object 85 Q6 | Sandwich-sized bags filled with BooBerry Crunch 1699 non-null object 86 Q6 | Skittles 1769 non-null object 87 Q6 | Smarties (American) 1750 non-null object 88 Q6 | Smarties (Commonwealth) 1573 non-null object 89 Q6 | Snickers 1785 non-null object 90 Q6 | Sourpatch Kids (i.e. abominations of nature) 1737 non-null object 91 Q6 | Spotted Dick 1593 non-null object 92 Q6 | Starburst 1782 non-null object 93 Q6 | Sweet Tarts 1767 non-null object 94 Q6 | Swedish Fish 1760 non-null object 95 Q6 | Sweetums (a friend to diabetes) 1472 non-null object 96 Q6 | Take 5 1557 non-null object 97 Q6 | Tic Tacs 1761 non-null object 98 Q6 | Those odd marshmallow circus peanut things 1739 non-null object 99 Q6 | Three Musketeers 1767 non-null object 100 Q6 | Tolberone something or other 1769 non-null object 101 Q6 | Trail Mix 1767 non-null object 102 Q6 | Twix 1785 non-null object 103 Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein 1683 non-null object 104 Q6 | Vicodin 1686 non-null object 105 Q6 | Whatchamacallit Bars 1652 non-null object 106 Q6 | White Bread 1718 non-null object 107 Q6 | Whole Wheat anything 1728 non-null object 108 Q6 | York Peppermint Patties 1770 non-null object 109 Q7: JOY OTHER 917 non-null object 110 Q8: DESPAIR OTHER 723 non-null object 111 Q9: OTHER COMMENTS 389 non-null object 112 Q10: DRESS 1730 non-null object 113 Unnamed: 113 9 non-null object 114 Q11: DAY 1750 non-null object 115 Q12: MEDIA [Daily Dish] 85 non-null float64
    116 Q12: MEDIA [Science] 1376 non-null float64
    117 Q12: MEDIA [ESPN] 99 non-null float64
    118 Q12: MEDIA [Yahoo] 67 non-null float64
    119 Click Coordinates (x, y) 1619 non-null object dtypes: float64(4), int64(1), object(115)
    memory usage: 2.3+ MB
    Of course, if you are just looking for the column names, you can just use a simple for loop.
    In [23]:
    # print a list of column namesfor col in candy.columns: print(col)Internal ID
    Q1: GOING OUT?
    Q2: GENDER
    Q3: AGE
    Q4: COUNTRY
    Q5: STATE, PROVINCE, COUNTY, ETC
    Q6 | 100 Grand Bar
    Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes)
    Q6 | Any full-sized candy bar
    Q6 | Black Jacks
    Q6 | Bonkers (the candy)
    Q6 | Bonkers (the board game)
    Q6 | Bottle Caps
    Q6 | Box’o’Raisins
    Q6 | Broken glow stick
    Q6 | Butterfinger
    Q6 | Cadbury Creme Eggs
    Q6 | Candy Corn
    Q6 | Candy that is clearly just the stuff given out for free at restaurants
    Q6 | Caramellos
    Q6 | Cash, or other forms of legal tender
    Q6 | Chardonnay
    Q6 | Chick-o-Sticks (we donÕt know what that is)
    Q6 | Chiclets
    Q6 | Coffee Crisp
    Q6 | Creepy Religious comics/Chick Tracts
    Q6 | Dental paraphenalia
    Q6 | Dots
    Q6 | Dove Bars
    Q6 | Fuzzy Peaches
    Q6 | Generic Brand Acetaminophen
    Q6 | Glow sticks
    Q6 | Goo Goo Clusters
    Q6 | Good N’ Plenty
    Q6 | Gum from baseball cards
    Q6 | Gummy Bears straight up
    Q6 | Hard Candy
    Q6 | Healthy Fruit
    Q6 | Heath Bar
    Q6 | Hershey’s Dark Chocolate
    Q6 | HersheyÕs Milk Chocolate
    Q6 | Hershey’s Kisses
    Q6 | Hugs (actual physical hugs)
    Q6 | Jolly Rancher (bad flavor)
    Q6 | Jolly Ranchers (good flavor)
    Q6 | JoyJoy (Mit Iodine!)
    Q6 | Junior Mints
    Q6 | Senior Mints
    Q6 | Kale smoothie
    Q6 | Kinder Happy Hippo
    Q6 | Kit Kat
    Q6 | LaffyTaffy
    Q6 | LemonHeads
    Q6 | Licorice (not black)
    Q6 | Licorice (yes black)
    Q6 | Lindt Truffle
    Q6 | Lollipops
    Q6 | Mars
    Q6 | Maynards
    Q6 | Mike and Ike
    Q6 | Milk Duds
    Q6 | Milky Way
    Q6 | Regular M&Ms
    Q6 | Peanut M&MÕs
    Q6 | Blue M&M’s
    Q6 | Red M&M’s
    Q6 | Green Party M&M’s
    Q6 | Independent M&M’s
    Q6 | Abstained from M&M’ing.
    Q6 | Minibags of chips
    Q6 | Mint Kisses
    Q6 | Mint Juleps
    Q6 | Mr. Goodbar
    Q6 | Necco Wafers
    Q6 | Nerds
    Q6 | Nestle Crunch
    Q6 | Now’n’Laters
    Q6 | Peeps
    Q6 | Pencils
    Q6 | Pixy Stix
    Q6 | Real Housewives of Orange County Season 9 Blue-Ray
    Q6 | ReeseÕs Peanut Butter Cups
    Q6 | Reese’s Pieces
    Q6 | Reggie Jackson Bar
    Q6 | Rolos
    Q6 | Sandwich-sized bags filled with BooBerry Crunch
    Q6 | Skittles
    Q6 | Smarties (American)
    Q6 | Smarties (Commonwealth)
    Q6 | Snickers
    Q6 | Sourpatch Kids (i.e. abominations of nature)
    Q6 | Spotted Dick
    Q6 | Starburst
    Q6 | Sweet Tarts
    Q6 | Swedish Fish
    Q6 | Sweetums (a friend to diabetes)
    Q6 | Take 5
    Q6 | Tic Tacs
    Q6 | Those odd marshmallow circus peanut things
    Q6 | Three Musketeers
    Q6 | Tolberone something or other
    Q6 | Trail Mix
    Q6 | Twix
    Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
    Q6 | Vicodin
    Q6 | Whatchamacallit Bars
    Q6 | White Bread
    Q6 | Whole Wheat anything
    Q6 | York Peppermint Patties
    Q7: JOY OTHER
    Q8: DESPAIR OTHER
    Q9: OTHER COMMENTS
    Q10: DRESS
    Unnamed: 113
    Q11: DAY
    Q12: MEDIA [Daily Dish]
    Q12: MEDIA [Science]
    Q12: MEDIA [ESPN]
    Q12: MEDIA [Yahoo]
    Click Coordinates (x, y)
    This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.
    Note: Unless the instructions ask you to do something different, please always update the original candy DataFrame for the exercises below. The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations. Also, feel free to add additional cells as needed.
    Data Cleaning
    Exercise1: Taking a look at the column names, you may notice that some include the character Õ. This should instead be an apostrophe ‘ mark. Rename the column names that include the Õ character and replace it with an apostrophe.
    Remember that you should be updating the candy DataFrame for the tasks listed as unless told differently.
    In [24]:
    candy_columns=candy.replace(‘Õ’, “‘”, inplace=True)candy_columnsQ1: How many duplicated rows are there in the file? Assume that a duplicate is any row that is exactly the same as another one. Save this number as Q1.
    In [25]:
    Q1= candy.duplicated(keep=False).sum()Q1Out[25]:
    34Q2: How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same Internal ID number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same Internal ID as another. Save this number as Q2.
    In [26]:
    Q2= candy[‘Internal ID’].duplicated(keep=False).sum()Q2Out[26]:
    38Exercise2: Drop any duplicates from the candy DataFrame. Duplicates are to be defined as any row with the same Internal ID as another. Use the default setting that keeps the first record from the duplicates.
    In [27]:
    candy= candy.drop_duplicates(subset=‘Internal ID’)candyOut[27]:
    Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
    090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
    190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
    290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
    390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
    490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
    …………………………………………………………
    247490314359NoMale24USAMDJOYDESPAIRMEHDESPAIR…Fruit Stripe GumNaNWhite and goldNaNFridayNaNNaNNaNNaNNaN
    247590314580NoFemale33USANew YorkMEHDESPAIRJOYNaN…CapersNaNBlue and blackNaNFridayNaN1.0NaNNaN(70, 26)
    247690314634NoFemale26USATennesseeMEHDESPAIRJOYDESPAIR…NaNNaNBlue and blackNaNFridayNaN1.0NaNNaN(67, 35)
    247790314658NoMale58UsaNorth CarolinaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
    247890314802NoFemale66usaPennsylvaniaDESPAIRDESPAIRJOYDESPAIR…NaNYou hit all my chocolate highlights, and broug…White and goldNaNSunday1.0NaNNaNNaN(19, 26)
    2460 rows × 120 columns
    Exercise3: Your next task is to remove the following columns from the candy DataFrame as we will not use these columns for this project. You are welcome to do further analysis on these columns but do not save your analysis in this notebook.
    Remove the following columns: Internal ID, Q5: STATE, PROVINCE, COUNTY, ETC, Q7: JOY OTHER, Q8: DESPAIR OTHER, Q9: OTHER COMMENTS, Unnamed: 113, Click Coordinates (x, y).
    In [28]:
    candy_columns_remove=[ ‘Internal ID’, ‘Q5: STATE, PROVINCE, COUNTY, ETC’, ‘Q7: JOY OTHER’, ‘Q8: DESPAIR OTHER’, ‘Q9: OTHER COMMENTS’, ‘Unnamed: 113’, ‘Click Coordinates (x, y)’]candy.drop(columns=candy_columns_remove,inplace=True)Code Check: As a check for the above exercises, the shape of your data should now be: (2460, 113)
    In [30]:
    candy.shapeOut[30]:
    (2460, 113)Exercise4: Let’s now take a look at the Q2: GENDER column since this will be what we are trying to predict. Take a look at the value counts for this column.
    In [31]:
    candy=candy[‘Q2: GENDER’].value_counts()candyOut[31]:
    Q2: GENDER
    Male 1466
    Female 839
    I’d rather not say 83
    Other 30
    Name: count, dtype: int64Q3: How many missing values are in the Q2: GENDER column? Save this as Q3.
    In [37]:
    Q3 = candy[‘Q2: GENDER’].isnull().sum()Q3—————————————————————————
    KeyError Traceback (most recent call last)
    File ~anaconda3Libsite-packagespandascoreindexesbase.py:3653, in Index.get_loc(self, key)
    3652 try:
    -> 3653 return self._engine.get_loc(casted_key)
    3654 except KeyError as err:
    File ~anaconda3Libsite-packagespandas_libsindex.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
    File ~anaconda3Libsite-packagespandas_libsindex.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
    File pandas_libshashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    File pandas_libshashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    KeyError: ‘Q2: GENDER’
    The above exception was the direct cause of the following exception:
    KeyError Traceback (most recent call last)
    Cell In[37], line 1
    —-> 1 Q3 = candy[‘Q2: GENDER’].isnull().sum()
    2 Q3
    File ~anaconda3Libsite-packagespandascoreseries.py:1007, in Series.__getitem__(self, key)
    1004 return self._values[key]
    1006 elif key_is_scalar:
    -> 1007 return self._get_value(key)
    1009 if is_hashable(key):
    1010 # Otherwise index.get_value will raise InvalidIndexError
    1011 try:
    1012 # For labels that don’t resolve as scalars like tuples and frozensets
    File ~anaconda3Libsite-packagespandascoreseries.py:1116, in Series._get_value(self, label, takeable)
    1113 return self._values[label]
    1115 # Similar to Index.get_value, but we do not fall back to positional
    -> 1116 loc = self.index.get_loc(label)
    1118 if is_integer(loc):
    1119 return self._values[loc]
    File ~anaconda3Libsite-packagespandascoreindexesbase.py:3655, in Index.get_loc(self, key)
    3653 return self._engine.get_loc(casted_key)
    3654 except KeyError as err:
    -> 3655 raise KeyError(key) from err
    3656 except TypeError:
    3657 # If we have a listlike key, _check_indexing_error will raise
    3658 # InvalidIndexError. Otherwise we fall through and re-raise
    3659 # the TypeError.
    3660 self._check_indexing_error(key)
    KeyError: ‘Q2: GENDER’
    Exercise5: Using the candy DataFrame, remove all rows with a missing value in the Q2: GENDER column. (This should overwrite and be saved as candy like you have been doing for the previous exercises.)
    In [ ]:
    ### ENTER CODE HERE ###Exercise6: For this project, we want to use binary classification, which predicts one of two classes. We want to predict between Male or Female. Because of this, select only the rows that contain either Male or Female in the Q2: GENDER column.
    In [ ]:
    ### ENTER CODE HERE ###Code Check: As a check for the above exercises, the shape of your data should now be: (2305, 113)
    In [ ]:
    ### ENTER CODE HERE ###Now, let’s work on filling some of the missing data. There are easier ways to do this with the sklearn library which you will learn about more in the machine learning classes, but for now, let’s try to practice our Pandas skills.
    Q4: How many missing values are in the Q1: GOING OUT? column? Save this number as Q4.
    In [ ]:
    ### ENTER CODE HERE ###Exercise7: For a future analysis question, we are interested in those that we know will definitely go out for Halloween. Because of this, fill all missing values in the Q1: GOING OUT? column with a No value.
    In [ ]:
    ### ENTER CODE HERE ###Code Check: Double check your above work and look at the value counts for the Q1: GOING OUT? column. Make sure that you only have “Yes” and No” values and that they add up to 2305, which is the number of rows you should have at this step in the assignment.
    In [ ]:
    ### ENTER CODE HERE ###Exercise8: To get ready for the next step, let’s practice selecting all the columns: going from Q6 | 100 Grand Bar to Q11: DAY. Save this slice as candy_slice.
    In [ ]:
    ### ENTER CODE HERE ###Exercise9: Now that you know how to slice the data, fill any missing values in the candy DataFrame for those columns (going from Q6 | 100 Grand Bar to Q11: DAY) with the string NO_ANSWER. Make sure you are working with the candy DataFrame and not the candy_slice DataFrame.
    In [ ]:
    ### ENTER CODE HERE ###Exercise10: For all four Q12: Media columns in the candy DataFrame, fill the missing values with 0.0.
    In [ ]:
    ### ENTER CODE HERE ###Code Check: As a check for the above code, make sure that there are no missing values left for the Q6 to Q12 columns.
    In [ ]:
    ### ENTER CODE HERE ###Now, let’s look at the very messy Q4: COUNTRY column and see what we can do about it. First, run the code below to look at the different unique values in the data.
    In [ ]:
    # check unique valuescandy[‘Q4: COUNTRY’].unique()Code Check: As a check for the Country column, check to see how many unique values are in the data. You should have 115 different unique values for the Q4: COUNTRY column. If you have less or more than this number, double check your work above.
    In [ ]:
    # check the Q4: COUNTRY number of unique valuescandy[‘Q4: COUNTRY’].nunique()We want to clean up this data to only include four areas: USA, Canada, Europe (the continent, not necessarily the European Union), and Other.
    There are different ways to do this, but I would suggest that you look at the way we handled the property_type column in the vienna data set and the code in the amenities_to_columns() function in the module notebook. These might be a little harder than those examples but they should give you a good baseline approach.
    You could use replace() for this step, and it is fine if you ultimately decide to do this, but I would suggest that you come up with a solution similar to what was shown in the vienna data cleaning notebook. This method would be much more robust if you had many more values in your data.
    I suggest the following order for this section to make it easier:
    Fill in all missing values with Other
    Code Australia as Other (doing this step will help when trying to use us in the next step if you use string methods)
    Combine all USA entries together as USA
    Combine Canadian entries as CA
    Combine European entries as EU
    Everything else gets coded as Other
    Exercise11: Fill the missing values in the Q4: COUNTRY column with Other.
    In [ ]:
    ### ENTER CODE HERE ###Code Check: Double check that there are no missing values in the Q4: COUNTRY column. Also, double check the unique values to make sure that “Other” was added. This should mean that you now have 116 unique values for this column.
    In [ ]:
    # check missing Q4 values### ENTER CODE HERE ###In [ ]:
    # check unique values### ENTER CODE HERE ###Exercise12: Combine all Australia entries into Other. Watch out for capitalization issues. You should have 114 unique values after this step.
    In [ ]:
    ### ENTER CODE HERE ###In [ ]:
    # check number of unique values### ENTER CODE HERE ###Exercise13: Combine all United States entries together into USA. These would include the following:
    'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!'
    In [ ]:
    ### ENTER CODE HERE ###Code Check: You should be merging the above values together into 1 (USA) and be left with 61 unique values after this step (including the USA value).
    In [ ]:
    # check unique values### ENTER CODE HERE ###Exercise14: Combine the Canadian entries (both upper and lower case) and label them as CA. Be careful as there are extra spaces, characters, and misspellings (Can, Canae).
    These values include:
    'canada', 'Canada', 'canada ', 'Canada ', 'Can', 'Canae', 'Canada`', 'CANADA'
    In [ ]:
    ### ENTER CODE HERE ###Code Check: You should be merging 8 values together into 1 (CA) and be left with 54 unique values after this step (including the CA value).
    In [ ]:
    # check unique values### ENTER CODE HERE ###Exercise15: Combine the European entries and label them as EU. Again, we are looking at the continent of Europe and not necessarily the countries that are a part of the European Union.
    These values include:
    'uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom'
    In [ ]:
    ### ENTER CODE HERE ###Code Check: You should be merging 26 entries together and be left with 29 unique values after this step (including the EU value).
    In [ ]:
    # check unique values### ENTER CODE HERE ###Exercise16: Finally, combine the other entries and label them as Other.
    In [ ]:
    ### ENTER CODE HERE ###Code Check: Double check that you only have four unique values in the Q4: COUNTRY column: USA, Other, CA, and EU
    In [ ]:
    # check values### ENTER CODE HERE ###Q5: To double check that everything was coded correctly, save the value counts of the Q4: COUNTRY column as Q5. You can check this once you run your CodeGrade check.
    In [ ]:
    ### ENTER CODE HERE ###We now want to look at the Q3: AGE column. Let’s look at all the unique values.
    In [ ]:
    # check unique age values### ENTER CODE HERE ###Again, this is a pretty messy column of data. This is a good example of why those that create online surveys shouldn’t allow the individual to just put any value into the field. But it is now our job to clean this up.
    Exercise17: Your task is to put these values into the following categorical bins: unknown, 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+.
    The category labels should exactly match the above.
    Missing values should be replaced with the unknown category
    To make things easier and avoid ambiguity, let’s say that any value with text, even if we could determine the age, will be binned with the unknown category. For example: sixty-nine should be coded as unknown, 45-55 should be coded as unknown, 59 on the day after Halloween should be coded as unknown, etc.
    Ensure that the category labels are unordered but reorder the categories so that ‘unknown’ is listed in the first position. This is not really needed but will help us grade your assignment. The categories should be listed as follows: Index([‘unknown’, ’17 and under’, ’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’], dtype=’object’)
    First, we will replace any non-numeric value (those with text as mentioned above) with a missing value. This will allow you to turn the other values into floats so that you can bin them. Just don’t forget to code the missing values as unknown when you are done. To replace the non-numeric values, run the following code:
    In [ ]:
    # create True/False indexage_index = candy[‘Q3: AGE’].str.isnumeric()# for the index, fill missing values with Falseage_index = age_index.fillna(False)# select Age column for only those False values from index and code as missingcandy.loc[~age_index, ‘Q3: AGE’] = np.nanIn [ ]:
    ### ENTER REST OF CODE HERE ###Exercise18: Double check yourself by checking the categories for the Q3: AGE column. It should output: Index([‘unknown’, ’17 and under’, ’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’], dtype=’object’)
    In [ ]:
    # double check categories### ENTER CODE HERE ###Code Check: To double check your above binning worked correctly, your value counts (sorted by the index) should be as follows:
    unknown: 60
    17 and under: 49
    18-25: 85
    26-35: 520
    36-45: 768
    46-55: 525
    56+: 298
    In [ ]:
    ### ENTER CODE HERE ###You can also double check some of your work up to this point by making sure that there are no missing values in the data set anymore.
    Code Check: Check to see if there are any missing values in the data set. Your output should show 0.
    In [ ]:
    ### ENTER CODE HERE ###Exercise19: Before you move on to the next section, reset the index for candy ensuring that it goes from 0 to n-1.
    In [ ]:
    ### ENTER CODE HERE ###Feature Engineering
    Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models. In this section, we will create a new column called “net_feelies” (calculated by the authors as the total joy count minus the total despair count).
    First, let’s narrow down our data to make working with it easier.
    Exercise20: Select only the Q6 candy columns (Q6 | 100 Grand Bar through Q6 | York Peppermint Patties) in the data set and save this as a new DataFrame called candy_reduced.
    In [ ]:
    ### ENTER CODE HERE ###Next, we will create two Series, one with JOY counts and one with DESPAIR counts to add to our candy_reduced data.
    Exercise21: Create a Series called joy_count that lists total counts for JOY for each column, making sure to keep it in the same order as the columns in the candy_reduced DataFrame. Hint: A simple way to do this is to filter the entire DataFrame for any JOY values and then use count(). See this stackoverflow question and answers.
    In [ ]:
    ### ENTER CODE HERE ###Exercise22: Same as above except you will create a Series called despair_count that lists the total counts for DESPAIR for each column.
    In [ ]:
    ### ENTER CODE HERE ###Exercise23: Take the transpose of the candy_reduced DataFrame and save this transposed data as candy_reduced_transpose.
    In [ ]:
    ### ENTER CODE HERE ###Exercise24: Add a new column called “joy_count” using the joy_count Series above and a new column called ‘despair_count” using the despair_count Series above to the candy_reduced_transpose DataFrame.
    In [ ]:
    ### ENTER CODE HERE ###Exercise25: Add a new column to the candy_reduced_transpose DataFrame called “net_feelies” that takes the joy_count column and subtracts the despair_count column.
    In [ ]:
    ### ENTER CODE HERE ###Exercise26: Select only the joy_count, despair_count, and net_feelies columns from the candy_reduced_transpose DataFrame. Sort this DataFrame in descending order by net_feelies and save this as candy_net_sorted.
    In [ ]:
    ### ENTER CODE HERE ###Encoding
    We now want to get the candy DataFrame ready to run a machine learning algorthim to determine if we could predict a person’s gender based on what candy they prefer.
    You will learn more about this in the machine learning classes, but some algorithms work exclusively with numeric values. We will now turn all of our values into numeric values. There are easier ways to do this with sklearn, which you will study in later courses, but we will use Pandas to perform these exercises for further practice.
    Exercise27: For grading purposes, we want to leave the candy DataFrame as is. Make a copy of the candy DataFrame and save this new DataFrame as candy_encode.
    In [ ]:
    ### ENTER CODE HERE ###Exercise28: For the candy_encode DataFrame, replace any Female values with 0 and any Male values with 1.
    In [ ]:
    ### ENTER CODE HERE ###Exercise29: Again, you will learn more about this later, but we need to separate the column that we want to predict (called the response) and the columns that we will use to make the predictions (called the features). For both of the items below, make sure that the index is reset and goes from 0 to n-1.
    Select only the Q2: GENDER column from candy_encode and save this as candy_response. Note: This should be a Series.
    Drop the following column

  • “Inside the World of Software Development: The Role and Skills of a Programmer”

    What do programmers do? (Theory of Python)
    Real Physics
    The Role of a Programmer: Understanding Software Development and Management
    Essential Skills for Software
    Developers
    Managing Managers and Team MembersFinancial Considerations for Developers: Salary, Stock Options, and DividendsUnderstanding Business and Startup CultureTips for Negotiating Salary and Career Advancement

  • Understanding the Role and Skills of a Programmer in Software Development and Management

    What do programmers do? (Theory of Python)
    Real Physics
    The Role of a Programmer: Understanding Software Development and Management
    Essential Skills for Software Developers
    Managing Managers and Team MembersFinancial Considerations for Developers: Salary, Stock Options, and DividendsUnderstanding Business and Startup CultureTips for Negotiating Salary and Career Advancement

  • Finding the Second Longest Average Visit Time for an Attraction # Importing necessary libraries import sqlite3 # Connecting to the database conn = sqlite3.connect(‘VASTChallenge.db’) cur = conn.cursor() # Query to find the

    very simple task…I need to answer 1 question from the database “.db” file that I provided. Use python. show the code and also the output. I need source code and also the output for the question. please use sqlite3 library and python. question: What ride (note that not all attractions are rides) has the second longest average visit time? It has to be a name of the ride.
    Additional Information:
    The information contained in each of these tables is listed below:
    checkin: – The check-in data for all visitors for the day in the park. The data includes two types of check-ins: inferred and actual checkins.
    – Fields: visitorID, timestamp, attraction, duration, type
    attraction: – The attractions in the park by their corresponding AttractionID, Name, Region, Category, and type. Regions are from the VAST Challenge map such as Coaster Alley, Tundra Land, etc. Categories include Thrill rides, Kiddie Rides, etc. Type is broken into Outdoor Coaster, Other Ride, Carousel, etc.
    – Fields: AttractionID, Name, Region, Category, type
    sequences: – The check-in sequences of visitors. These sequences list the position of each visitor to the park every five minutes. If the visitor has not entered the part yet, the sequence has a value of 0 for that time interval. If the visitor is in the park, the sequence lists the attraction they have most recently checked in to until they check in to a new one or leave the park.
    – Fields: visitorID, sequence

  • Title: “Exploring IoT Opportunities for Entrepreneurs: A Business Idea and Implementation using Python” Introduction: The Internet of Things (IoT) has revolutionized the way we interact with technology, connecting everyday objects to the internet and allowing for

    Exercise from:
    Intro to Python for Computer Science and Data Science 1e
    Exercise 17.21
    (Research Project: IoT for the Entrepreneur) If you’re inclined to start a company, IoT presents many opportunities. Research IoT opportunities for entrepreneurs and create and describe an original idea for a business.
    Teacher instructions:
    Research project 17.21 – Write a 4-5 page paper on IoT and what you think would be a good idea!!  Incorporate how you would use Python in this project.

  • “Completing the Solution: Solving Parts F and G of a Code Challenge”

    Provided is the code and solution for part a,b,c. I want you to continue to solve only parts ( F + G ) completly

  • “Word Frequency Dictionary Generator”

    I need help with coding assignment asap! Write a Python program that takes a sentence as input and returns a dictionary containing the frequency of each word in the sentence. Ignore case sensitivity and punctuation marks.

  • “Optimizing the Model: Correcting Errors on Page 19”

    on page 19, can you make it right? because i’m trying to optimize the model, but it went to a totally wrong, please fix it.

  • “Computational Thinking Concepts and Approaches Applied to a Selected Scenario”

    Task 1 – Applying Computational thinking Concepts and Approaches on a Selected Scenario
    nI this part of the portfolio you shall work individually and select one of the scenarios from Appendix – Aat the latest by the end of first week and start analysing and applying computational thinking concepts and approaches. There are following two tasks ni Part :I
    Part 1: Decomposition: Problem Identification and Requirements
    You’ll apply decomposition, identify one or more problems and derive requirements ot eb solved. Your write-up should include:
    .I alist of possible problems,
    Il. the process and techniques that you have used ot think about the problems, and
    I. analysis of the problems and derive requirements which seem necessary to solve those problems.
    This part of the portfolio si worth 30 marks ni terms of the overall module assessment.
    Part 2: Applying Computational Thinking approaches and design a solution
    nI this task you wil select one or more problems identified ni the task 1 and design a solution by applying suitable computational concepts and techniques. Individually, you should consider decomposition, pattern recognition, abstraction, modelling, algorithm/pseudocode design that would help solve the problem. Your design should be written ni such a way that the pseudocode could be translated more or less line by line into Python code. You must also specify how you would test that your design si correct and give your test data. To gain high marks ni this task you should provide at least one suitable example for patterns, abstraction and modelling of the selected scenario. Task 2 is worth 50 Marks
    Part 2: Applying Computational Thinking approaches and design a solution
    In this task you will select one or more problems identified in the task 1 and design a solution by applying suitable computational concepts and techniques. Individually, you should consider decomposition,
    pattern recognition, abstraction, modelling, algorithm/pseudocode design that would help solve the problem. Your design should be written ni such a way that the pseudocode could be translated more or less line yb line into Python code. You must also specify how you would test that your design is correct and give your test data. To gain high marks ni this task you should provide at least one suitable example for patterns, abstraction and modelling of the selected scenario. Task 2 si worth 05 Marks ni the overall module assessment.
    Task 2 – Evaluation and Consolidated Solution of a Selected Scenarie
    nI this part you are required ot evaluate and reflect on the solutions provided by each student ni task 1. You should provide two strengths and two weaknesses for the solution provided ni Task 1 and how these can be improved. This evaluation should also consider how those solutions ot coleritualombesterildseriesconsolot codid or simulated solution or part of a solution wil be provided, to gain higher marks for this task you should provide working solution ni programing.

  • “Maximizing Profitability for Confectionaries Are Us: A Data Mining Investigation and Report in Python”

    Task 1 – Data Mining Mini Project in Python
    Detail Task 2
    You have recently joined Confectionaries Are Us as a junior analyst. As your first
    data mining investigation, your line manager has asked you to build and complete
    appropriate models so that the site based company can use your optimum, relative
    investigation for targeted increase profitability over time, both for individual countries
    and collectively, based upon differing variables.
    Before the analysis can be conducted, ensure that all pre-processing has been
    undertaken on the data set and the data set is fully understood through initial
    descriptive statistics. For the main body of the analysis, you are to choose the most
    appropriate time modelling techniques and processes for the investigation, based
    upon the data. In doing so, this will allow insight into the overall analysis and
    appropriate meaningful solutions. The full investigation is to be written in Python.
    A clear, concise analysis is to be given within the task, complimented with screenshot
    evidence of all processes and results. The full code for Task 1 is to be included within an
    appendix, so that it can be checked and verified it is working correctly.
    (70 Total marks)
    (LO’s: 2 & 3)
    (2800 Word Equivalence for the code)
    Task 2 – Data mining report
    Detail task 2
    Based upon the completed data mining investigation in Task 1, you will now write a
    comprehensive report that can be easily understood by the companies board of
    directors and higher management. Contained within you will include full reasoning
    for your chosen solutions, in addition to your motivation, investigations processes,
    results and conclusions.
    Page 5 of 9
    (30 Total marks)
    (LO’s: 1 & 4)
    You are to submit a single Word document for all tasks and your code for Task 1 is to
    be included within an appendix, so that it can be checked and verified it is working
    correctly. Your student ID number must be clearly defined upon the uploaded file.