cheeseDM/cheese.ipynb

648 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "5f7c9658-c285-4854-96c0-e899fc55421b",
"metadata": {},
"source": [
"# DM project: cheese"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f4f2b89-8257-468c-9f5e-a77e11b8b8ff",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import time\n",
"import json\n",
"import random\n",
"\n",
"from matplotlib import colors\n",
"import matplotlib.pyplot as plt\n",
"import plotly.express as px\n",
"import tqdm.notebook as tqdm\n",
"from IPython.display import display, HTML\n",
"\n",
"from geopy.geocoders import Nominatim\n",
"\n",
"\n",
"import pandas as pd\n",
"\n",
"from mlxtend.frequent_patterns import apriori, association_rules\n",
"\n",
"from sklearn import tree\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet"
]
},
{
"cell_type": "markdown",
"id": "ceb71784-b0bf-4015-b8e6-78007c368e49",
"metadata": {},
"source": [
"For this project, we chose to study cheeses. We retrieved a [dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cheese) that gives several characteristics for more than $1000$ cheeses. We have information about the origin, the milk, types, texture, rind, flavor, etc. of these cheeses. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a0afba8-692b-4377-a2ce-5114983e3bbb",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"cheeses.csv\")\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "bf3b548c-5ac4-4126-9ae9-5578ad158015",
"metadata": {},
"source": [
"## I. Cleaning and pre-processing"
]
},
{
"cell_type": "markdown",
"id": "44e3c761-c985-451b-9a02-22f37918b5a9",
"metadata": {},
"source": [
"We achieved several tasks during cleaning and preprocessing: \n",
"- Dropping some rows that did not have enough data\n",
"- Dropping some columns that had too much ```NaN``` values (eg. ```fat_content```)\n",
"- Convert locations (given as region and country name) to GPS coordinates to use them in linear regression\n",
"- Same for cheese colors: we converted them to RGB. \n",
"- Some characteristics of cheeses were given as lists of adjectives, we chose to put them as booleans into separate columns to ease the processing. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d76fde3-8c65-4b50-a097-6dd81a68c1ca",
"metadata": {},
"outputs": [],
"source": [
"data.describe().T.plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"id": "4590cffd-d4a9-4e15-8fd5-cbb22f048300",
"metadata": {},
"source": [
"Since `calcium_content` and `fat_content` columns have too much null values, we choose to remove them. \n",
"Similarly, we removed other columns we are not interested in: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8489ffa-1067-4eb7-b65a-2fa18fdb4b04",
"metadata": {},
"outputs": [],
"source": [
"unused_columns = [ \"fat_content\", \"calcium_content\", \"alt_spellings\", \"producers\", \"url\", \"synonyms\"]\n",
"for col in unused_columns:\n",
" if col in data.columns:\n",
" del data[col]\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "d42869b5-a4ea-4cd6-bd0e-1532af90f2da",
"metadata": {},
"source": [
"### Converting the locations to GPS coordinates\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "74044e9b-6ce4-420f-b1ad-492a4362ffb4",
"metadata": {},
"source": [
"Now, we are interested in having only one column representing the location for each cheese. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "633ed80e-e416-41f6-ae58-b86ce4c132af",
"metadata": {},
"outputs": [],
"source": [
"data=data.dropna(subset=[\"country\",\"region\"], how=\"all\")\n",
"data=data.fillna(value={\"country\":\"\"})\n",
"data=data.fillna(value={\"region\":\"\"})\n",
"print(f\"{len(data)} rows remaining\")\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "fd66568f-78d4-4e1a-a91c-8ec483b4b03c",
"metadata": {},
"source": [
"We removed 6 rows for which we could not find a suitable location. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ef7494b-ff08-40a5-890f-e0f718cf2842",
"metadata": {},
"outputs": [],
"source": [
"data.loc[data.country.str.contains(\"England, Great Britain, United Kingdom\")|data.country.str.contains(\"England, United Kingdom\"),\"country\"]=\"England\"\n",
"data.loc[data.country.str.contains(\"Scotland\"),\"country\"]=\"Scotland\"\n",
"data.loc[data.country.str.contains(\"Great Britain, United Kingdom, Wales\")|data.country.str.contains(\"United Kingdom, Wales\"),\"country\"]=\"Wales\""
]
},
{
"cell_type": "markdown",
"id": "c479661d-4019-4557-8c53-d4223f0f246c",
"metadata": {},
"source": [
"We change some countries to get more easily the location. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb044984-c33c-492c-91a2-4e9fff29ceb3",
"metadata": {},
"outputs": [],
"source": [
"data=data.drop(index=data[data[\"country\"].str.contains(\",\")].index)\n",
"data=data.drop(index=data[data[\"country\"].str.contains(\" and \")].index)\n",
"data.reset_index()\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "2f42c973-247a-4f51-947e-fbd76f8f12fc",
"metadata": {},
"source": [
"We removed 41 cheeses because they can come froms several countries. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59c4e6e7-d624-45a5-a9ea-eb375102b771",
"metadata": {},
"outputs": [],
"source": [
"data[\"location\"]=data[\"region\"]+\", \"+data[\"country\"]"
]
},
{
"cell_type": "markdown",
"id": "d52eb9d7-7ce9-4ddb-81f8-b251c9754b87",
"metadata": {},
"source": [
"In order to have more numeric data to apply a classification algorithm, we transform the location to GPS coordinates and the color to RGB. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "debb780e-ec13-4502-ac44-6001335e507d",
"metadata": {},
"outputs": [],
"source": [
"def str_to_gps(loc):\n",
" l=loc.split(\",\")\n",
" loc=\",\".join([l[0],l[-1]])# removing details gives less errors while fetching the GPS coordinates\n",
" try:\n",
" res=Nominatim(user_agent=\"dmProject\").geocode(loc) \n",
" return (res.latitude, res.longitude)\n",
" # In the real world, we would have used a real (non-free) API to compute all this. \n",
" # This one is free and does not give so bad results. \n",
" except AttributeError:\n",
" loc=l[-1]\n",
" res=Nominatim(user_agent=\"dmProject\").geocode(loc) \n",
" return (res.latitude, res.longitude)\n",
" \n",
"def get_locations(backup_file):\n",
" errors=set()\n",
" if os.path.isfile(backup_file):\n",
" with open(backup_file) as f:\n",
" return json.load(f)\n",
" locations_to_gps = {}\n",
" for loc in tqdm.tqdm(locs):\n",
" time.sleep(1) # We don't want to overload the Nominatim server which will stop responding\n",
" try:\n",
" locations_to_gps[loc] = str_to_gps(loc)\n",
" print(loc, locations_to_gps[loc])\n",
" except AttributeError:\n",
" errors.add(loc)\n",
" print(loc, file=sys.stderr)\n",
" with open(backup_file, \"w\") as f:\n",
" json.dump(locations_to_gps, f)\n",
" return locations_to_gps"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "204d1446-e58f-4585-8ac0-7466930e4291",
"metadata": {},
"outputs": [],
"source": [
"locs=set(data[\"location\"])\n",
"locations_to_gps = get_locations(\"locations_to_gps.json\")\n",
"latitudes, longitudes = [], []\n",
"for i, value in enumerate(data.location):\n",
" latitudes.append(locations_to_gps[value][0])\n",
" longitudes.append(locations_to_gps[value][1])\n",
"data[\"latitude\"] = latitudes\n",
"data[\"longitude\"] = longitudes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d41b1dc8-90df-44b8-9d83-d218f82a3637",
"metadata": {},
"outputs": [],
"source": [
"fig = px.scatter_map(data, \n",
" lat=\"latitude\", \n",
" lon=\"longitude\", \n",
" hover_name=\"cheese\", \n",
" hover_data=[\"cheese\"],\n",
" color=\"milk\",\n",
" zoom=1.5,\n",
" height=800,\n",
" width=1400)\n",
"\n",
"fig.update_layout(mapbox_style=\"open-street-map\")\n",
"fig.update_layout(margin={\"r\":0,\"t\":0,\"l\":0,\"b\":0})\n",
"fig.show();"
]
},
{
"cell_type": "markdown",
"id": "92f7516f-e401-4e27-be68-367558671913",
"metadata": {},
"source": [
"### Converting the text data to boolean values\n",
"\n",
"We want to transform the many characteristics of the cheeses to boolean values, to be able to use them as numeric data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66ce4e4a-7006-411f-abd0-ee94d7cf99b3",
"metadata": {},
"outputs": [],
"source": [
"def text_to_boolean(df, cols=None):\n",
" if cols is None:\n",
" cols = [\"milk\", \n",
" \"color\",\n",
" \"type\", \"texture\", \"flavor\", \"aroma\", \"family\", \"rind\"]\n",
"\n",
" df = df.copy()\n",
" attributes = set() # Get all the possible attributes (some are mixed in different columns)\n",
" for col in cols:\n",
" values = set()\n",
" for val in set(df[col]):\n",
" if type(val) == float: # skip NaN values\n",
" continue\n",
" values = values.union([x.strip() for x in set(val.split(\",\"))])\n",
" attributes = attributes.union(values)\n",
" row_attrs = [set() for _ in range(len(df))] # get the attributes specific to each row\n",
" for col in cols:\n",
" for i, row in enumerate(df[col]):\n",
" if type(row) != float:\n",
" row_attrs[i] = row_attrs[i].union([x.strip() for x in row.split(\",\")])\n",
" for attr in attributes: # Add attributes rows\n",
" df[attr] = list(attr in row_attrs[i] for i in range(len(df[col])))\n",
" df=df.copy()\n",
" for col in cols:\n",
" del df[col]\n",
"\n",
" return df.copy()"
]
},
{
"cell_type": "markdown",
"id": "d1eb67d7-d16b-4b93-8486-582830ac3903",
"metadata": {},
"source": [
"Similarly, we convert the colors to their RGB representations. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8073de1-eecb-4c5e-8636-a4c76cace706",
"metadata": {},
"outputs": [],
"source": [
"def color_columns(data):\n",
" \"\"\"\n",
" Returns 3 columns corresponding to approximate RGB values of the colors on the cheeses\n",
" \"\"\"\n",
" color_to_hex = {\n",
" 'blue': \"#50564B\",\n",
" 'blue-grey': \"#504E4A\",\n",
" 'brown': \"#D19651\",\n",
" 'brownish yellow': \"#E5CD80\",\n",
" 'cream': \"#D9D9CE\",\n",
" 'golden orange': \"#D0915D\",\n",
" 'golden yellow': \"#DCBE9A\",\n",
" 'green': \"#6AA57F\",\n",
" 'ivory': \"#E8C891\",\n",
" 'orange': \"#C7980D\",\n",
" 'pale white': \"#DAD5C2\",\n",
" 'pale yellow': \"#F3D7B1\",\n",
" 'pink and white': \"#C0AB94\",\n",
" 'red': \"#984F18\",\n",
" 'straw': \"#F8EAC6\",\n",
" 'white': \"#F8F8F8\",\n",
" 'yellow': \"#EBD88B\",\n",
" }\n",
" color_to_rgb = {color: colors.to_rgb(color_to_hex[color]) for color in color_to_hex}\n",
" data_colors = list(color_to_rgb[color] if color in color_to_rgb else (0, 0, 0) for color in data[\"color\"])\n",
" return list(c[0] for c in data_colors), list(c[1] for c in data_colors), list(c[2] for c in data_colors)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "471728e0-5543-4afd-bf54-d21bd49dda75",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea474256-3974-47a8-b8c8-ee252236a6c8",
"metadata": {},
"outputs": [],
"source": [
"data[\"color_r\"], data[\"color_g\"], data[\"color_b\"] = color_columns(data)\n",
"data_features=text_to_boolean(data)\n",
"data_features"
]
},
{
"cell_type": "markdown",
"id": "a1b022a3-a2f9-4e39-9e79-48ae9f6adca5",
"metadata": {},
"source": [
"## II. Classification"
]
},
{
"cell_type": "markdown",
"id": "979b9eef-9ca2-4299-a4e0-e8d3813f45c6",
"metadata": {},
"source": [
"In this part, we achieved to do two things for the classification: create a decision tree on the database and, given a cheese and its characteristics, find where it originates from. \n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "da7e65cd-5324-496b-affd-246ae4cf9813",
"metadata": {},
"source": [
"### II.A Decision tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8d0848f-b844-4a08-976d-4d1370070f73",
"metadata": {},
"outputs": [],
"source": [
"Y=LabelEncoder().fit_transform(data_features[\"country\"])\n",
"X=data_features.drop(columns=[\"cheese\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n",
"data_train, data_test, target_train, target_test = train_test_split(\n",
" X, Y)\n",
"c=tree.DecisionTreeClassifier(max_depth=4,random_state=0)\n",
"c=c.fit(data_train,target_train)\n",
"plt.figure(figsize=(150,100))\n",
"ax=plt.subplot()\n",
"\n",
"tree.plot_tree(c,ax=ax,filled=True,feature_names=X.columns,);"
]
},
{
"cell_type": "markdown",
"id": "fca7080e-cb7b-4030-bafd-9036ecdb15ab",
"metadata": {},
"source": [
"We built a decision tree for our cheese database. \n",
"We noticed that the most relevant features, those used by the decision tree, focus on the texture of the cheese and the taste on the cheeses (rindless, bloomy, soft, tangy), rather than on the animal milk used. \n"
]
},
{
"cell_type": "markdown",
"id": "30bf1cd5-9b95-4300-a172-f36d870c49f6",
"metadata": {},
"source": [
"### Linear regression: find location depending on the cheese characteristics\n",
"\n",
"We try to do a linear regression over the data to see whether, given a cheese, we can guess where it originates from. We are going to see that it does not work very well, each regression model has a $R^2$ coefficient of less than $0.3$, which is very bad. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73488360-5ba3-4361-aa1a-7c8764b14acd",
"metadata": {},
"outputs": [],
"source": [
"for col in [\"cheese\",\"country\",\"region\",\"location\",\"vegetarian\",\"vegan\"]:\n",
" try: \n",
" del data_features[col]\n",
" except:\n",
" pass\n",
"data_features"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05a6fae7-7dae-41f2-add0-86017116ea11",
"metadata": {},
"outputs": [],
"source": [
"X=data_features.copy()\n",
"del X[\"latitude\"]\n",
"del X[\"longitude\"]\n",
"y=data_features[[\"longitude\",\"latitude\"]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a0bb4d6-dd0b-451a-b698-3cb6d0b4241d",
"metadata": {},
"outputs": [],
"source": [
"for model in LinearRegression(), Ridge(), Lasso(), ElasticNet(): \n",
" model.fit(X,y)\n",
" print(model.score(X,y))"
]
},
{
"cell_type": "markdown",
"id": "731e3935-c913-4b1c-b7ca-94392d64ccca",
"metadata": {},
"source": [
"Not good, even quite bad. \n",
"We cannot find the region a cheese originates from given its characteristic. \n",
"\n",
"\n",
"In short, it seems that we cannot find the region a cheese originates from given its characteristic. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fd507d0-1a68-4cd7-a12e-12c9ab1061e3",
"metadata": {},
"outputs": [],
"source": [
"model.predict(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9faf2aee-84f5-4633-b3de-039af42d31d3",
"metadata": {},
"outputs": [],
"source": [
"yprime=pd.DataFrame(model.predict(X),columns=[\"latitude\",\"longitude\"])"
]
},
{
"cell_type": "markdown",
"id": "038cd38e-3890-4f73-91a7-c30294b3bc5b",
"metadata": {},
"source": [
"## III. Pattern Mining"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e6b0dc1-030c-4239-803f-52736a41bcb5",
"metadata": {},
"outputs": [],
"source": [
"unused_columns = {\"vegetarian\", \"vegan\", \"cheese\", \"region\", \"color\", \"location\", \"latitude\", \"longitude\", \"country\",\"color_r\",\"color_g\",\"color_b\"}\n",
"data_features_only=data_features.drop(columns=list(unused_columns.intersection(data_features.columns)))\n",
"print(\"Number of features:\", data_features_only.shape[1])"
]
},
{
"cell_type": "markdown",
"id": "b76e8b2f-2efc-43f7-9aa7-fffb960313ad",
"metadata": {},
"source": [
"We have $164$ features in our data, that is very big compared to the number of rows of our data. So, we choose a min_support of $0.1$ during the apriori algorithm for pattern mining. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7113235-7546-4c71-9b34-181472466d20",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"frequent_itemsets = apriori(data_features_only,min_support=.05, use_colnames=True)\n",
"display(HTML(frequent_itemsets.to_html()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61959c04-61bf-464a-89ca-72ec4782f927",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"assoc_rules = association_rules(frequent_itemsets, min_threshold=.5)\n",
"assoc_rules=assoc_rules.sort_values(by=['confidence'], ascending=False)\n",
"display(HTML(assoc_rules.to_html()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3a2a838-bc56-4de8-ac5d-f1c3327f5447",
"metadata": {},
"outputs": [],
"source": [
"assoc_rules[assoc_rules[\"antecedents\"].astype(str).str.contains(\"rich\")]"
]
},
{
"cell_type": "markdown",
"id": "84e2f426-8077-46c7-bc7e-357e631972d2",
"metadata": {},
"source": [
"For Pattern Mining, we only kept relevant columns (binary attributes) thus dropping RGB colors and any location based information, keeping only information relevant to the final cheese itself.\n",
"\n",
"We applied the apriori algorithm for frequent itemsets and searched for association rules.\n",
"\n",
"If we observe the association rules with the highest degree of confidence, we can interpolate the following statements (then verified to be true):\n",
"- cheddar is primarily a cow cheese\n",
"- "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "104b476d-5531-40e7-8bf6-987f00a8f5c1",
"metadata": {},
"outputs": [],
"source": [
"data_f=text_to_boolean(data)\n",
"data_f[(data_f[\"bloomy\"] == True)]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}