versionfinale (?)

This commit is contained in:
François Colin de Verdière 2025-04-01 11:05:13 +02:00
parent 0795cf233b
commit dcf18fa71c

View File

@ -36,7 +36,9 @@
"from sklearn import tree\n", "from sklearn import tree\n",
"from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n", "from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet" "from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.neighbors import KNeighborsClassifier"
] ]
}, },
{ {
@ -44,7 +46,7 @@
"id": "ceb71784-b0bf-4015-b8e6-78007c368e49", "id": "ceb71784-b0bf-4015-b8e6-78007c368e49",
"metadata": {}, "metadata": {},
"source": [ "source": [
"For this project, we chose to study cheeses. We retrieved a [dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cheese) that gives several characteristics for more than $1000$ cheeses. We have information about the origin, the milk, types, texture, rind, flavor, etc. of these cheeses. " "For this project, we chose to study cheeses. We retrieved [the following dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cheese) that gives several characteristics for more than $1000$ cheeses. We have information about the origin, the milk, types, texture, rind, flavor, etc. of these cheeses. "
] ]
}, },
{ {
@ -55,6 +57,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"data = pd.read_csv(\"cheeses.csv\")\n", "data = pd.read_csv(\"cheeses.csv\")\n",
"\n",
"data" "data"
] ]
}, },
@ -86,7 +89,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"data.describe().T.plot(kind='bar')" "data.describe().T.plot(kind='bar');"
] ]
}, },
{ {
@ -117,7 +120,7 @@
"id": "d42869b5-a4ea-4cd6-bd0e-1532af90f2da", "id": "d42869b5-a4ea-4cd6-bd0e-1532af90f2da",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Converting the locations to GPS coordinates\n", "### I.A Converting the locations to GPS coordinates\n",
"\n" "\n"
] ]
}, },
@ -293,7 +296,7 @@
"id": "92f7516f-e401-4e27-be68-367558671913", "id": "92f7516f-e401-4e27-be68-367558671913",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Converting the text data to boolean values\n", "### I.B Converting the text data to boolean values, and the colors to RGB\n",
"\n", "\n",
"We want to transform the many characteristics of the cheeses to boolean values, to be able to use them as numeric data. " "We want to transform the many characteristics of the cheeses to boolean values, to be able to use them as numeric data. "
] ]
@ -377,14 +380,6 @@
" return list(c[0] for c in data_colors), list(c[1] for c in data_colors), list(c[2] for c in data_colors)" " return list(c[0] for c in data_colors), list(c[1] for c in data_colors), list(c[2] for c in data_colors)"
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"id": "471728e0-5543-4afd-bf54-d21bd49dda75",
"metadata": {},
"outputs": [],
"source": []
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
@ -410,44 +405,7 @@
"id": "979b9eef-9ca2-4299-a4e0-e8d3813f45c6", "id": "979b9eef-9ca2-4299-a4e0-e8d3813f45c6",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In this part, we achieved to do two things for the classification: create a decision tree on the database and, given a cheese and its characteristics, find where it originates from. \n", "In this part, we try to achieve the following task: given a cheese, can we find where it originates from ?"
"\n"
]
},
{
"cell_type": "markdown",
"id": "da7e65cd-5324-496b-affd-246ae4cf9813",
"metadata": {},
"source": [
"### II.A Decision tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8d0848f-b844-4a08-976d-4d1370070f73",
"metadata": {},
"outputs": [],
"source": [
"Y=LabelEncoder().fit_transform(data_features[\"country\"])\n",
"X=data_features.drop(columns=[\"cheese\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n",
"data_train, data_test, target_train, target_test = train_test_split(\n",
" X, Y)\n",
"c=tree.DecisionTreeClassifier(max_depth=4,random_state=0)\n",
"c=c.fit(data_train,target_train)\n",
"plt.figure(figsize=(150,100))\n",
"ax=plt.subplot()\n",
"\n",
"tree.plot_tree(c,ax=ax,filled=True,feature_names=X.columns,);"
]
},
{
"cell_type": "markdown",
"id": "fca7080e-cb7b-4030-bafd-9036ecdb15ab",
"metadata": {},
"source": [
"We built a decision tree for our cheese database. \n",
"We noticed that the most relevant features, those used by the decision tree, focus on the texture of the cheese and the taste on the cheeses (rindless, bloomy, soft, tangy), rather than on the animal milk used. \n"
] ]
}, },
{ {
@ -455,9 +413,9 @@
"id": "30bf1cd5-9b95-4300-a172-f36d870c49f6", "id": "30bf1cd5-9b95-4300-a172-f36d870c49f6",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Linear regression: find location depending on the cheese characteristics\n", "### II.A Linear regression: find location depending on the cheese characteristics\n",
"\n", "\n",
"We try to do a linear regression over the data to see whether, given a cheese, we can guess where it originates from. We are going to see that it does not work very well, each regression model has a $R^2$ coefficient of less than $0.3$, which is very bad. \n" "We try to do a linear regression over the data to see whether, given a cheese, we can guess where it originates from (GPS coordinates). We are going to see that it does not work very well, each regression model has a $R^2$ coefficient of less than $0.3$, which is very bad. \n"
] ]
}, },
{ {
@ -467,6 +425,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"old_data_features=data_features.copy()\n",
"for col in [\"cheese\",\"country\",\"region\",\"location\",\"vegetarian\",\"vegan\"]:\n", "for col in [\"cheese\",\"country\",\"region\",\"location\",\"vegetarian\",\"vegan\"]:\n",
" try: \n", " try: \n",
" del data_features[col]\n", " del data_features[col]\n",
@ -505,11 +464,9 @@
"id": "731e3935-c913-4b1c-b7ca-94392d64ccca", "id": "731e3935-c913-4b1c-b7ca-94392d64ccca",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Not good, even quite bad. \n", "Those result are not good, even quite bad. We cannot find the region a cheese originates from given its characteristic. \n",
"We cannot find the region a cheese originates from given its characteristic. \n",
"\n", "\n",
"\n", "In short, it seems that we cannot find the precise location a cheese originates from given its characteristic. "
"In short, it seems that we cannot find the region a cheese originates from given its characteristic. "
] ]
}, },
{ {
@ -532,6 +489,96 @@
"yprime=pd.DataFrame(model.predict(X),columns=[\"latitude\",\"longitude\"])" "yprime=pd.DataFrame(model.predict(X),columns=[\"latitude\",\"longitude\"])"
] ]
}, },
{
"cell_type": "markdown",
"id": "da7e65cd-5324-496b-affd-246ae4cf9813",
"metadata": {},
"source": [
"### II.A Decision tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8d0848f-b844-4a08-976d-4d1370070f73",
"metadata": {},
"outputs": [],
"source": [
"data_features=old_data_features.copy()\n",
"Y=LabelEncoder().fit_transform(data_features[\"country\"])\n",
"X=data_features.drop(columns=[\"cheese\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n",
"X_train, X_test, Y_train, Y_test = train_test_split(\n",
" X, Y, random_state=0,test_size=.1)\n",
"c=tree.DecisionTreeClassifier(max_depth=4,random_state=0)\n",
"c=c.fit(X_train,Y_train)\n",
"plt.figure(figsize=(150,100))\n",
"ax=plt.subplot()\n",
"\n",
"tree.plot_tree(c,ax=ax,filled=True,feature_names=X.columns,);"
]
},
{
"cell_type": "markdown",
"id": "fca7080e-cb7b-4030-bafd-9036ecdb15ab",
"metadata": {},
"source": [
"We built a decision tree for our cheese database. \n",
"We noticed that the most relevant features, those used by the decision tree, focus on the texture of the cheese and the taste on the cheeses (rindless, bloomy, soft, tangy), rather than on the animal milk used. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4efc8ceb-e3ca-4c77-baff-91138c486bd8",
"metadata": {},
"outputs": [],
"source": [
"predY_train=c.predict(X_train)\n",
"predY_test=c.predict(X_test)\n",
"ac_train=accuracy_score(Y_train, predY_train)\n",
"ac_test=accuracy_score(Y_test, predY_test)\n",
"print(f\"{ac_train=},{ac_test=}\")"
]
},
{
"cell_type": "markdown",
"id": "93202819-ce65-44cf-a9b1-a8fc52c9ac2e",
"metadata": {},
"source": [
"The accuracy of the classifier is quite bad. We try another classifier, based on K-nearest-neighbors: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6db8995e-442d-448a-8bec-bb4250066ef2",
"metadata": {},
"outputs": [],
"source": [
"data_features=old_data_features.copy()\n",
"Y=LabelEncoder().fit_transform(data_features[\"country\"])\n",
"X=data_features.drop(columns=[\"cheese\",\"vegan\",\"country\",\"region\",\"vegetarian\",\"location\",\"latitude\",\"longitude\"])\n",
"X_train, X_test, Y_train, Y_test = train_test_split(\n",
" X, Y, random_state=0,test_size=.1)\n",
"c=KNeighborsClassifier()\n",
"c=c.fit(X_train,Y_train)\n",
"\n",
"predY_train=c.predict(X_train)\n",
"predY_test=c.predict(X_test)\n",
"ac_train=accuracy_score(Y_train, predY_train)\n",
"ac_test=accuracy_score(Y_test, predY_test)\n",
"print(f\"{ac_train=},{ac_test=}\")"
]
},
{
"cell_type": "markdown",
"id": "c10f4683-2b79-4944-b92d-ae998c6ba072",
"metadata": {},
"source": [
"The accuracy on the train dataset is better, but the accuracy on the test dataset is the same as before.\n",
"Thus, determining where a cheese originates from is not easy, because multiple countries can produce very similar cheeses. "
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "038cd38e-3890-4f73-91a7-c30294b3bc5b", "id": "038cd38e-3890-4f73-91a7-c30294b3bc5b",
@ -587,16 +634,6 @@
"display(HTML(assoc_rules.to_html()))" "display(HTML(assoc_rules.to_html()))"
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"id": "a3a2a838-bc56-4de8-ac5d-f1c3327f5447",
"metadata": {},
"outputs": [],
"source": [
"assoc_rules[assoc_rules[\"antecedents\"].astype(str).str.contains(\"rich\")]"
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "84e2f426-8077-46c7-bc7e-357e631972d2", "id": "84e2f426-8077-46c7-bc7e-357e631972d2",
@ -606,20 +643,23 @@
"\n", "\n",
"We applied the apriori algorithm for frequent itemsets and searched for association rules.\n", "We applied the apriori algorithm for frequent itemsets and searched for association rules.\n",
"\n", "\n",
"If we observe the association rules with the highest degree of confidence, we can interpolate the following statements (then verified to be true):\n", "If we observe the association rules with the highest degree of confidence, we can deduce that, for instance, cheddar is primarily a cow cheese. "
"- cheddar is primarily a cow cheese\n",
"- "
] ]
}, },
{ {
"cell_type": "code", "cell_type": "markdown",
"execution_count": null, "id": "f8298f55-8676-4f2a-bace-62b9f3a89cd7",
"id": "104b476d-5531-40e7-8bf6-987f00a8f5c1",
"metadata": {}, "metadata": {},
"outputs": [],
"source": [ "source": [
"data_f=text_to_boolean(data)\n", "## Conclusion\n"
"data_f[(data_f[\"bloomy\"] == True)]" ]
},
{
"cell_type": "markdown",
"id": "1369354c-ecbd-4a42-89d7-9fc77328db57",
"metadata": {},
"source": [
"We did not achieve to get clear results. Maybe, one conclusion of our study can be that very similar cheeses are produced all over the world. Thus, we cannot link the origin of a cheese with its characteristics. "
] ]
} }
], ],