Example of output for 101000,ru (Moscow, September 25):
Weather characteristics | Weather data |
---|---|
Cloudiness | scattered clouds |
Temperature, C | 20 |
Pressure, hpa | 1020 |
Humidity, % | 47 |
Temperature minimal, C | 19 |
Temperature maximal, C | 20 |
Visibility, m | 10000 |
Wind speed, m/s | 4.77 |
Wind degree | 184 |
Geocoords | [37.6156, 55.7522] |
Link to GitHub: Link to Github
Back to top: Back to TopThe aim of this project is to predict type of the wine-white or red-based on its chemical properties and physical properties. For this project two wine datasets were used that contain information about some chemical components for both the white wine and the red wine. The two analyzed datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Vinho Verde (pronounced “veeng-yo vaird”) is a Portuguese wine that comes from the region of Vinho Verde. The 10 wine features were included in datasets together with a column containing quality score of each wine (score ranges between 0 and 10). Each dataset contains following features:
Logistic regression model was used to predict type of wine (red or white wine) based on wine chemical and physical properties.
Model performance are summarized in the table below:
precision | recall | f1-score | support | |
---|---|---|---|---|
white wine | 0.99 | 0.99 | 0.99 | 1225 |
red wine | 0.98 | 0.98 | 0.98 | 400 |
accuracy:0.99
Conclusion: The logistic regression model shows a good performance with accuracy=0.99, recall (white wine)=0.99, and recall (red wine)=0.98
Thw table below shows the ratio between the mean values of red and the mean values white wines. The figure below shows a grouped barplot of the values for each feature of white and red wine
wine features | ratio of mean values of red and white wines |
---|---|
fixed_acidity | 1.213697 |
volatile_acidity | 1.896990 |
citric_acid | 0.810839 |
residual_sugar | 0.397221 |
chlorides | 1.910903 |
free_sulfur_dioxide | 0.449612 |
total_sulfur_dioxide | 0.335845 |
pH | 1.038531 |
sulphates | 1.343581 |
alcohol | 0.991318 |
quality | 0.958848 |
Conclusion: The most important features to descriminate between white wine and red wine are:
Back to top: Back to Top
The goal of this project is to classify (cluster) tumor cell based on gene expression profiling.
Figures below show clustering performance for DBSCAN and K means
Performance metrics | k-means | DBSCAN |
---|---|---|
silhouette score | 0.512 | 0.722 |
adjusted rand index | 0.379 | 0.523 |
Conclusion: K-means and DBSCAN algorithms were used to separate analyzed cancer cell lines into clusters based on gene expression signature. The k-means algorithm showed a better performance in comparison with the DBSCAN algorithm:
Thus, cancer cell lines can be clustered based on the gene signature expression.
Link to GitHub: Link to GithubBack to top: Back to Top
The following nuclei features have been used for predictions
1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness (local variation in radius)
6. Compactness
7. Concavity
8. Concave points
9. Symmetry
10. Fractal Dimension
The following ML algorithms were used in the current project: Logistic Regression, KNN,Naive Bayes, Decision Tree, Random Forest, SVM. Results are summarized in the table:
Model performance | Logistic Regression | KNN | Naive Bayes | Decision Tree | Random Forest | SVM |
---|---|---|---|---|---|---|
recall (benign) | 1.00 | 0.99 | 0.89 | 0.89 | 0.91 | 0.98 |
recall (malignant) | 0.89 | 0.91 | 0.89 | 0.96 | 0.94 | 0.89 |
accuracy | 0.96 | 0.96 | 0.89 | 0.92 | 0.92 | 0.94 |
Conclusion:The highest accuracy (0.96) show the logistic regression model and knn, while highest recall (malignant tumor) 0.96 shows the decision tree model. Since the goal is to predict breast cancer that the decision tree model shows the best performance in terms of prediction malignant tumor
Link to GitHub: Link to GithubBack to top: Back to Top
This data set contains 4000 records and 16 variables. These 16 variables describe each participant of this study. The purpose of this case study is to investigate whether there is a risk of coronary heart disease within 10 years given such variables. This data set contains following variables:
Demographic:
The logistic regression model was used in the current project to predict whether the heart disease occur within 10 years based on features listed above. Initially, model had a poor performance in terms of heart disease prediction, but after hyperparameter tuning the performance was greatly improved as shown in table below
Model performance | Default threshold 0.5 | The best threshold 0.14 | recall (no heart disease) | 0.99 | 0.64 | recall (heart disease) | 0.09 | 0.73 |
---|
Conclusion:Using the best threshold obtained from ROC curve greatly improved the model performance in terms of heart disease prediction with recall=0.73
Link to GitHub: Link to Github
Back to top: Back to Top
The goal of this project is to build a few ML models, select the best one and select several the most important features in predicting fraud.
Decision Tree, Random Forest, and Logistic regression were used to build models.
In the current project, a data set from the US Irvine Machine Learning Repository will be analyzed. The data set includes transactions made by credit cards during two days in September 2013 in Europe. There are totally 284,807 transactions. Among these transactions 492 are fraudulent. The dataset is highly imbalanced because fraudulent transactions account for 0.172% of all transactions.
The data set contains totally 31 variables. 28 variables are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features are not provided. Two feature variables 'Time' and 'Amount' are not transformed by PCA. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The ‘Time’ variable is irrelevant to the current study and was deleted from the data frame. The feature 'Amount' is the transaction amount. The variable 'Class' is the target variable and it takes value 1 in case of fraud and 0 otherwise. Model performance characteristics are shown in the table below:
Model performance | Logistic Regression | Decision Tree | Random Forest |
---|---|---|---|
recall (no fraud) | 0.96 | 1 | 1 |
recall (fraud) | 0.91 | 0.83 | 0.78 |
accuracy | 0.96 | 1 | 1 |
Conclusion The best performance to detect credit card fraud showed the logistic regression model with threshold 0.001644 The six top important features to detect fraud are V4, V10, V14, V20, V9, V17
Link to GitHub: Link to GithubBack to top: Back to Top
For this project I choose 3 historical data sets for the US zip 43212 (Columbus, Ohio). The first data set I downloaded from https://www.worldweatheronline.com/ (in csv format). The data contains historical weather data from 07/01/2008 to 10/25/2020 in hourly basis. I selected only data for noon for each day. Next, historical weather data from 09/30/2011 to 09/29/2014 were extracted.
The dataset was split on three partsThe second dataset was downloaded from https://openweathermap.org again in csv format. I created the html format from the csv file. The html file contains historical weather data from 09/29/2014 to 09/29/2017.
The dataset was split on three partsThe third data set was downloaded from https://www.visualcrossing.com
The whole idea of this project is to compare temperature for 9 consecutive periods and evaluate temperature trends
Link to GitHub: Link to Github