From dcf18fa71c0da6a236b874f11706e44effc3684a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fran=C3=A7ois=20Colin=20de=20Verdi=C3=A8re?= Date: Tue, 1 Apr 2025 11:05:13 +0200 Subject: [PATCH] versionfinale (?) --- cheese.ipynb | 192 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 116 insertions(+), 76 deletions(-) diff --git a/cheese.ipynb b/cheese.ipynb index b040323..c825c62 100644 --- a/cheese.ipynb +++ b/cheese.ipynb @@ -36,7 +36,9 @@ "from sklearn import tree\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder\n", - "from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet" + "from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet\n", + "from sklearn.metrics import accuracy_score\n", + "from sklearn.neighbors import KNeighborsClassifier" ] }, { @@ -44,7 +46,7 @@ "id": "ceb71784-b0bf-4015-b8e6-78007c368e49", "metadata": {}, "source": [ - "For this project, we chose to study cheeses. We retrieved a [dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cheese) that gives several characteristics for more than $1000$ cheeses. We have information about the origin, the milk, types, texture, rind, flavor, etc. of these cheeses. " + "For this project, we chose to study cheeses. We retrieved [the following dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cheese) that gives several characteristics for more than $1000$ cheeses. We have information about the origin, the milk, types, texture, rind, flavor, etc. of these cheeses. " ] }, { @@ -55,6 +57,7 @@ "outputs": [], "source": [ "data = pd.read_csv(\"cheeses.csv\")\n", + "\n", "data" ] }, @@ -86,7 +89,7 @@ "metadata": {}, "outputs": [], "source": [ - "data.describe().T.plot(kind='bar')" + "data.describe().T.plot(kind='bar');" ] }, { @@ -117,7 +120,7 @@ "id": "d42869b5-a4ea-4cd6-bd0e-1532af90f2da", "metadata": {}, "source": [ - "### Converting the locations to GPS coordinates\n", + "### I.A Converting the locations to GPS coordinates\n", "\n" ] }, @@ -293,7 +296,7 @@ "id": "92f7516f-e401-4e27-be68-367558671913", "metadata": {}, "source": [ - "### Converting the text data to boolean values\n", + "### I.B Converting the text data to boolean values, and the colors to RGB\n", "\n", "We want to transform the many characteristics of the cheeses to boolean values, to be able to use them as numeric data. " ] @@ -377,14 +380,6 @@ " return list(c[0] for c in data_colors), list(c[1] for c in data_colors), list(c[2] for c in data_colors)" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "471728e0-5543-4afd-bf54-d21bd49dda75", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "code", "execution_count": null, @@ -410,44 +405,7 @@ "id": "979b9eef-9ca2-4299-a4e0-e8d3813f45c6", "metadata": {}, "source": [ - "In this part, we achieved to do two things for the classification: create a decision tree on the database and, given a cheese and its characteristics, find where it originates from. \n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "da7e65cd-5324-496b-affd-246ae4cf9813", - "metadata": {}, - "source": [ - "### II.A Decision tree" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8d0848f-b844-4a08-976d-4d1370070f73", - "metadata": {}, - "outputs": [], - "source": [ - "Y=LabelEncoder().fit_transform(data_features[\"country\"])\n", - "X=data_features.drop(columns=[\"cheese\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n", - "data_train, data_test, target_train, target_test = train_test_split(\n", - " X, Y)\n", - "c=tree.DecisionTreeClassifier(max_depth=4,random_state=0)\n", - "c=c.fit(data_train,target_train)\n", - "plt.figure(figsize=(150,100))\n", - "ax=plt.subplot()\n", - "\n", - "tree.plot_tree(c,ax=ax,filled=True,feature_names=X.columns,);" - ] - }, - { - "cell_type": "markdown", - "id": "fca7080e-cb7b-4030-bafd-9036ecdb15ab", - "metadata": {}, - "source": [ - "We built a decision tree for our cheese database. \n", - "We noticed that the most relevant features, those used by the decision tree, focus on the texture of the cheese and the taste on the cheeses (rindless, bloomy, soft, tangy), rather than on the animal milk used. \n" + "In this part, we try to achieve the following task: given a cheese, can we find where it originates from ?" ] }, { @@ -455,9 +413,9 @@ "id": "30bf1cd5-9b95-4300-a172-f36d870c49f6", "metadata": {}, "source": [ - "### Linear regression: find location depending on the cheese characteristics\n", + "### II.A Linear regression: find location depending on the cheese characteristics\n", "\n", - "We try to do a linear regression over the data to see whether, given a cheese, we can guess where it originates from. We are going to see that it does not work very well, each regression model has a $R^2$ coefficient of less than $0.3$, which is very bad. \n" + "We try to do a linear regression over the data to see whether, given a cheese, we can guess where it originates from (GPS coordinates). We are going to see that it does not work very well, each regression model has a $R^2$ coefficient of less than $0.3$, which is very bad. \n" ] }, { @@ -467,6 +425,7 @@ "metadata": {}, "outputs": [], "source": [ + "old_data_features=data_features.copy()\n", "for col in [\"cheese\",\"country\",\"region\",\"location\",\"vegetarian\",\"vegan\"]:\n", " try: \n", " del data_features[col]\n", @@ -505,11 +464,9 @@ "id": "731e3935-c913-4b1c-b7ca-94392d64ccca", "metadata": {}, "source": [ - "Not good, even quite bad. \n", - "We cannot find the region a cheese originates from given its characteristic. \n", + "Those result are not good, even quite bad. We cannot find the region a cheese originates from given its characteristic. \n", "\n", - "\n", - "In short, it seems that we cannot find the region a cheese originates from given its characteristic. " + "In short, it seems that we cannot find the precise location a cheese originates from given its characteristic. " ] }, { @@ -532,6 +489,96 @@ "yprime=pd.DataFrame(model.predict(X),columns=[\"latitude\",\"longitude\"])" ] }, + { + "cell_type": "markdown", + "id": "da7e65cd-5324-496b-affd-246ae4cf9813", + "metadata": {}, + "source": [ + "### II.A Decision tree" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8d0848f-b844-4a08-976d-4d1370070f73", + "metadata": {}, + "outputs": [], + "source": [ + "data_features=old_data_features.copy()\n", + "Y=LabelEncoder().fit_transform(data_features[\"country\"])\n", + "X=data_features.drop(columns=[\"cheese\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n", + "X_train, X_test, Y_train, Y_test = train_test_split(\n", + " X, Y, random_state=0,test_size=.1)\n", + "c=tree.DecisionTreeClassifier(max_depth=4,random_state=0)\n", + "c=c.fit(X_train,Y_train)\n", + "plt.figure(figsize=(150,100))\n", + "ax=plt.subplot()\n", + "\n", + "tree.plot_tree(c,ax=ax,filled=True,feature_names=X.columns,);" + ] + }, + { + "cell_type": "markdown", + "id": "fca7080e-cb7b-4030-bafd-9036ecdb15ab", + "metadata": {}, + "source": [ + "We built a decision tree for our cheese database. \n", + "We noticed that the most relevant features, those used by the decision tree, focus on the texture of the cheese and the taste on the cheeses (rindless, bloomy, soft, tangy), rather than on the animal milk used. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4efc8ceb-e3ca-4c77-baff-91138c486bd8", + "metadata": {}, + "outputs": [], + "source": [ + "predY_train=c.predict(X_train)\n", + "predY_test=c.predict(X_test)\n", + "ac_train=accuracy_score(Y_train, predY_train)\n", + "ac_test=accuracy_score(Y_test, predY_test)\n", + "print(f\"{ac_train=},{ac_test=}\")" + ] + }, + { + "cell_type": "markdown", + "id": "93202819-ce65-44cf-a9b1-a8fc52c9ac2e", + "metadata": {}, + "source": [ + "The accuracy of the classifier is quite bad. We try another classifier, based on K-nearest-neighbors: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6db8995e-442d-448a-8bec-bb4250066ef2", + "metadata": {}, + "outputs": [], + "source": [ + "data_features=old_data_features.copy()\n", + "Y=LabelEncoder().fit_transform(data_features[\"country\"])\n", + "X=data_features.drop(columns=[\"cheese\",\"vegan\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n", + "X_train, X_test, Y_train, Y_test = train_test_split(\n", + " X, Y, random_state=0,test_size=.1)\n", + "c=KNeighborsClassifier()\n", + "c=c.fit(X_train,Y_train)\n", + "\n", + "predY_train=c.predict(X_train)\n", + "predY_test=c.predict(X_test)\n", + "ac_train=accuracy_score(Y_train, predY_train)\n", + "ac_test=accuracy_score(Y_test, predY_test)\n", + "print(f\"{ac_train=},{ac_test=}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c10f4683-2b79-4944-b92d-ae998c6ba072", + "metadata": {}, + "source": [ + "The accuracy on the train dataset is better, but the accuracy on the test dataset is the same as before.\n", + "Thus, determining where a cheese originates from is not easy, because multiple countries can produce very similar cheeses. " + ] + }, { "cell_type": "markdown", "id": "038cd38e-3890-4f73-91a7-c30294b3bc5b", @@ -587,16 +634,6 @@ "display(HTML(assoc_rules.to_html()))" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "a3a2a838-bc56-4de8-ac5d-f1c3327f5447", - "metadata": {}, - "outputs": [], - "source": [ - "assoc_rules[assoc_rules[\"antecedents\"].astype(str).str.contains(\"rich\")]" - ] - }, { "cell_type": "markdown", "id": "84e2f426-8077-46c7-bc7e-357e631972d2", @@ -606,20 +643,23 @@ "\n", "We applied the apriori algorithm for frequent itemsets and searched for association rules.\n", "\n", - "If we observe the association rules with the highest degree of confidence, we can interpolate the following statements (then verified to be true):\n", - "- cheddar is primarily a cow cheese\n", - "- " + "If we observe the association rules with the highest degree of confidence, we can deduce that, for instance, cheddar is primarily a cow cheese. " ] }, { - "cell_type": "code", - "execution_count": null, - "id": "104b476d-5531-40e7-8bf6-987f00a8f5c1", + "cell_type": "markdown", + "id": "f8298f55-8676-4f2a-bace-62b9f3a89cd7", "metadata": {}, - "outputs": [], "source": [ - "data_f=text_to_boolean(data)\n", - "data_f[(data_f[\"bloomy\"] == True)]" + "## Conclusion\n" + ] + }, + { + "cell_type": "markdown", + "id": "1369354c-ecbd-4a42-89d7-9fc77328db57", + "metadata": {}, + "source": [ + "We did not achieve to get clear results. Maybe, one conclusion of our study can be that very similar cheeses are produced all over the world. Thus, we cannot link the origin of a cheese with its characteristics. " ] } ],