Trending November 2023 # Outliers Detection Using Iqr, Z # Suggested December 2023 # Top 17 Popular

You are reading the article Outliers Detection Using Iqr, Z updated in November 2023 on the website Minhminhbmm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Outliers Detection Using Iqr, Z

This article was published as a part of the Data Science Blogathon.

Introduction

data? Which methods will work well if data density is not the same through the dispersion? Identification of outliers? Etc. 

Guys, this article will help you to get many such questions answered along with practical applications, no matter if you are doing a data cleaning process before conducting EDA,/ passing data to a Machine learning model, o ng any statistical test.

What are Inliers and Outliers?

Ou are values that seem excessively different from most of the rest of the values in the given dataset. Outliers could normally exist due to new inventions (true outliers), development of new patterns/phenomena, experimental errors, rarely occurring incidents, anomalies, incorrectly fed data due to typographical mistakes, failure of data recording systems/components, etc. However, all outliers are not bad; some reveal new information. Inliers are all the data points that are part of the distribution except outliers. 

Outlier’s Identification

Collective Outliers: When a Group of datapoint deviates from the distribution, it is called a collective outlier. It is completely subjective to interpret their relevance according to the specific domain. Also, collective outliers show the formation of new phenomena or development. Ref. 

Contextual Outliers: These are specific conditions based on where the interpretation of its relevance becomes (i.e., the usual temperature in Leh during winter goes near 9°C which is the rarest phenomenon in Ahmedabad, Gujarat), Punctuation symbols while attempting text analysis,  background noise single while doing speech recognition, etc.)

Fig: 1 (Point/Global or collective Outliers)

For ease of understanding, I have considered a real case study on steel scrap sales over three years.

Real Case Example of Outliers

Considering a real-case scenario of Steel Sheet Scrap Rate (Rs/Kg) sold across India from 2023 to 2023 has been captured to understand the statistics and predict the price in the future. Still, before doing that, as part of the data-cleaning process, we want to understand the presence of an outlier and its weightage accordingly. 

Importing important libraries to load the dataset and conduct further analysis:

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as st %matplotlib inline import warnings warnings.filterwarnings('ignore') df=pdf.read_excel("scrap_data.xlsx", skiprows=2) df.head(), print('shape of data:',df.shape)

To understand the trend, I have tried to plot line plot on two main independent

variables (‘Scrap Rate’ and ‘Scrap Weight’)  with ref. to their date of sale. 

plt.figure(figsize =(15,5)) plt.subplot(1,2,1) sns.lineplot(x=df['Job Start Date'], y=df['Rate in Rs./Kg.'], color='r') plt.title("Steel Scrap Rate (Rs/Kg)", fontsize=20) plt.xlabel('Date') plt.subplot(1,2,2) sns.lineplot(x=df['Job Start Date'], y=df['Scrape Sale Qty.'], color='b') plt.title("Steel Scrap Weight (Rs/Kg)", fontsize=20) plt.xlabel('Date')

Looking at the trend in the Scrap Rate feature, we understand that there were sudden spikes in rates crossing 120 Rs/kg, indicating anomalies as the scrap rate must be the same and increase or decrease gradually. However, in the case of Scrap weight, depending on the size of a construction project, scrap generated at the end of closing the project may be high or low in volume anytime.

Let’s try applying the different methods of detecting and treating outliers:

Inter Quartile Range (IQR)

IQR measures variability by dividing the dataset into four equal quartiles. First, the entire data is to be sorted in ascending order, and then splitting it into four equal quartiles called Q1, Q2, Q3, and Q4, which can be calculated using the following equation. IQR Method is best suited when the data forms a skewed distribution.

The first Quartile (Q1) divides the smallest 25% of the values from the other 75% that are larger.

The Third quantile (Q3) divides the smallest 75%  of the values from the largest 25%.

Lower Bound Limit = Q1 – 1.5 x IQR

Upper Bound Limit = Q3 + 1.5 x IQR 

So outliers can be considered any values which are greater than Upper Bound Limit (Q3+1.5*IQR) and less than Lower Bound Limit (Q1-1.5*IQR) in the given dataset.

Let’s plot Boxplot to know the presence of outliers;

plt.figure(figsize=(15,5)) plt.subplot(1,2,1) sns.boxplot(df['Scrape Sale Qty.']) plt.xticks(fontsize = (12)) plt.xlabel('Steel-Scrap Weight (in Kgs)') plt.legend (title="Steel Scrap Weight", fontsize=10, title_fontsize=15) plt.subplot(1,2,2) sns.boxplot(df['Rate in Rs./Kg.']) plt.xlabel('Steel Scrap Rate Rs/kg') plt.xticks(fontsize =(12)); plt.legend (title="Steel Scrap Rate", fontsize=10, title_fontsize=15);

To make the calculation faster, I have created a function to derive Inter-Quartile-Range (IQR), Lower Fence, and Upper Fence and added conditions either to drop them or fill them with upper or lower values, respectively. 

def identifying_treating_outliers(df,col,remove_or_fill_with_quartile): q1=df[col].quantile(0.25) q3=df[col].quantile(0.75) iqr=q3-q1 lower_fence=q1-1.5*(iqr) upper_fence=q3+1.5*(iqr) print('Lower Fence;', lower_fence) print('Upper Fence:', upper_fence) print('Total number of outliers are left:', df[df[col] upper_fence].shape[0]) if remove_or_fill_with_quartile=="drop": df.drop(df.loc[df[col]<lower_fence].index,inplace=True) elif remove_or_fill_with_quartile=="fill": df[col] = np.where(df[col] < lower_fence, lower_fence, df[col])

Applying the Function to the Scrap Rate and Scrap Weight column:

identifying_treating_outliers(df,'Scrape Sale Qty.','drop') identifying_treating_outliers(df,'Rate in Rs./Kg.','drop')

DF shape before Application of Function : (1001, 5)

DF Shape after Application of Function : (925, 5)

Plotting Boxplot to check the status of outliers after applying the ‘indentifying_treating_outliers’ function: 

plt.figure(figsize=(15,5))

plt.subplot(1,2,1) sns.boxplot(df['Scrape Sale Qty.']) plt.xticks(fontsize = (12)) plt.xlabel('Steel-Scrap Weight (in Kgs)') plt.legend (title="Steel Scrap Weight", fontsize=10, title_fontsize=15) plt.subplot(1,2,2) sns.boxplot(df['Rate in Rs./Kg.']) plt.xlabel('Steel Scrap Rate Rs/kg') plt.xticks(fontsize =(12)); plt.legend (title="Steel Scrap Rate", fontsize=10, title_fontsize=15); Z-score Method

TheZ-score of the values is the difference between that value and the mean, divided by the standard deviation. Z-Scores help identify outliers by values if a particular data point has a Z-score value either less than -3 or greater than +3.Z score can be mathematically expressed as;

= particular value, =mean, =standard deviation

Below pic expresses the transformation of data from normal distribution to a standard normal distribution using a Z-score is given here for ref.

In our dataset, we will apply the Zscore for outliers with a Zscore of more than +3 and less than -3. Just a few lines of code will help us get Zscore, and we can see the differences using the distribution plot (before and after).

# Applying Zscore in Scrap Rate column defining dataframe by dfn zr = st.zscore(df['Rate in Rs./Kg.']) dfn = df[(zr-3)] # Applying Zscore in Steel Weight Column defining dataframe by dfnf zw= st.zscore(dfn['Scrape Sale Qty.']) dfnf = dfn[(zw-3)] plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.distplot(df['Rate in Rs./Kg.']) plt.title('Z Score Plot Before Removing Outlier',fontsize=15) plt.subplot(1,2,2) sns.distplot(st.zscore(dfn['Rate in Rs./Kg.'])) plt.title('Z Score Plot After Removing Outlier',fontsize=15)

Our data forms a Positive Skewed distribution (skewness value- 0.874) which cannot be considered approximately normally distributed through the above plot. Significant improvement can be seen comparing the plot shown before and after applying Zscore.

print('before df shape', df.shape) print('After df shape for Observation dropped in Scrap Rate', dfn.shape) print('After df shape for observation dropped in weight', dfnf.shape)

Using the Z Score method, in Scrap Rate and Scrap Weight columns, we have dropped 21 data points (3 from Scrap Rate and 18 from Scrap Weight) with Zscore -3.

Local Outliers Finder (LOF)

Local Outlier Finder is an unsupervised machine learning technique to detect outliers based on the closest neighborhood density of data points and works well when the spread of the dataset (density) is not the same. LOF basically considers  K-distance (distance between the points) and K-neighbors (set of points lies in the circle of K-distance (radius)). Ref. detailed documentation: SK-Learn  Library.

Lof takes two major parameters into consideration (1) n_neighbors: The number of neighbors which has a default value of 20 (2) Contamination: the proportion of outliers in the given dataset which can be set ‘auto’ or float values (0, 0.02, 0.005).

Importing important libraries and defining model

from sklearn.neighbors import LocalOutlierFactor d2 = df.values #converting the df into numpy array lof = LocalOutlierFactor(n_neighbors=20, contamination='auto') good = lof.fit_predict(d2) == 1 plt.figure(figsize=(10,5)) plt.scatter(d2[good, 1], d2[good, 0], s=2, label="Inliers", color="#4CAF50") plt.scatter(d2[~good, 1], d2[~good, 0], s=8, label="Outliers", color="#F44336") plt.title('Outlier Detection using Local Outlier Factor', fontsize=20) plt.legend (fontsize=15, title_fontsize=15)

In our case, I have set contamination as ‘auto’ (see the above plot) to see the result and found LOF is not performing that well as my data spread (density) is not deviating much. Also, I tried different Contamination values of 0.005, 0.01, 0.02, 0.05, and 0.09 but the performance was not that well.

Density-Based Spatial Clustering for Application with Noise (DBSCAN)

When our dataset is large enough that have multiple numeric features (multivariate) then it becomes difficult to handle outliers using IQR, Zscore, or LOF. Here SK-Learn library DBSCAN comes to the rescue to allow us to handle outliers for the Multi-variate datasets.

DBSCAN considers two main parameters (as mentioned below) to form a cluster with the nearest data point and based on the high or low-density region, it detects Inliers or outliers.

(1) Epsilon (Radius of datapoint that we can calculate based on k-distance graph)

However, in our case, we don’t have more than 5 features and we have just selected two important numeric features out of them to apply our learnings and visualize the same. Due to the technology & human brain’s limitation in visualizing the Multi-dimensional data altogether at the moment, we are applying DBSCAN to our dataset.

Importing the libraries and fitting the model. To nullify the noise in the data set, we have u normalized the data using Min-Max Scaler.

from sklearn.cluster import DBSCAN from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler() df[['Scrape Sale Qty.','Rate in Rs./Kg.']] = mms.fit_transform(df[['Scrape Sale Qty.','Rate in Rs./Kg.']]) df.head() from sklearn.neighbors import NearestNeighbors neigh = NearestNeighbors(n_neighbors=2) nbrs = neigh.fit(df[['Scrape Sale Qty.', 'Rate in Rs./Kg.']]) distances, indices = nbrs.kneighbors(df[['Rate in Rs./Kg.', 'Rate in Rs./Kg.']]) # Plotting K-distance Graph distances = np.sort(distances, axis=0) distances = distances[:,1] plt.figure(figsize=(8,5)) plt.plot(distances) plt.title('K-distance Graph',fontsize=20) plt.xlabel('Data Points sorted by distance',fontsize=14) plt.ylabel('Epsilon',fontsize=14) plt.show()

The above plot shows the Maximum Epsilon value is closing to 0.08, and for the sample size (number of points we want within the epsilon value of each data point), we are selecting 10 now. 

model = DBSCAN(eps = 0.08, min_samples = 10).fit(data) colors = model.labels_ plt.figure(figsize=(10,7)) plt.scatter(df['Rate in Rs./Kg.'], df['Scrape Sale Qty.'], c = colors) plt.title('Outliers Detection using DBSCAN',fontsize=20)

DBSCAN technique has efficiently detected the significant outliers using Density-Based Spatial Clustering and can be seen in the below plots.

Conclusion

IQR is the simplest and most mathematically explained technique. It is good for univariate and bivariate data to identify outliers as it considers the median as a measure of dispersion to detect extreme values but is limited to multivariate datasets while dealing with huge numbers of numeric features. In our case, we have applied it by defining a function to detect and treat outliers and detected 76 dropped data points as outliers.

DBSCAN does not require to define by a number of clusters and is able to detect anomalies where data spread is arbitrarily distributed and linearly not separable. It has its own limitations while working with varying density data spread. In our case, it been detected 16 datapoints as potential outliers.

Happy learning !!

For further details, Connect me;

[email protected]

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

You're reading Outliers Detection Using Iqr, Z

Room Occupancy Detection Using Machine Learning Algorithms

This article was published as a part of the Data Science Blogathon

In this article, we will see how we can detect room occupancy using environmental variables data with machine learning algorithms. For this purpose, I am using Occupancy Detection Dataset from UCI ML Repository. Here, Ground-truth occupancy is obtained from time-stamped pictures of environmental variables like temperature, humidity, light, CO2 were taken every minute. The implementation of an ML algorithm instead of a physical PIR sensor will be cost and maintenance-free. This might be useful in the field of HVAC (Heating, Ventilation, and Air Conditioning).

Data Understanding and EDA

Here we are using R for ML programming. The dataset zip has 3 text files, one for training the model and two for testing the model. For reading these files in R we use chúng tôi and to explore data structure, data dimensions, and 5 point statics of dataset we use the “summarytools” package. Images that are included here are captured from the R console while executing the code.

data= read.csv("datatrain.txt",header = T, sep = ",", row.names = 1) View(data) library(summarytools) summarytools::view(dfSummary(data))

Data Summary 

Observations

All environmental variables are read correctly as numeric type by R, however, we need to define a variable “date” as of date class type. For the ‘Date’ variable treatment, we use the “lubridate” package. Also, we have daytime information available in the date column which we can extract and use for modeling as occupancy for spaces like offices will depend on daytime. Another important observation is that we don’t have missing values in the entire dataset. An Occupancy variable needs to be defined as a factor type for further analysis.

library("readr") library("lubridate") data$date1 = as_date(data$date) data$date= as.POSIXct(data$date, format = "%Y-%m-%d %H:%M:%S") data$time = format(data$date, format = "%H:%M:%S") data1= data[,-1] data1$Occupancy = as.factor(data1$Occupancy)

Now our processed data looks like this:

Processed Data Preview

Next, we check two important aspects of data, one is the correlation plot of variables to understand multicollinearity issues in the dataset and the proportion of target variable distribution.

library(corrplot) numeric.list <- sapply(data1, is.numeric) numeric.list sum(numeric.list) numeric.df <- data1[, numeric.list] cor.mat <- cor(numeric.df) corrplot(cor.mat, type = "lower", method = "number")

Correlation Plot

library(plotrix) pie3D(prop.table((table(data1$Occupancy))), main = "Occupied Vs Unoccupied", labels=c("Unoccupied","Occupied"), col = c("Blue", "Dark Blue")) Pie Chart for Occupancy

From the correlation plot, we observe that temperature and light are positively correlated while temperature and Humidity are negatively correlated, humidity and humidity ratio are highly correlated which is obvious as the Humidity ratio is the ratio of Humidity and temperature. Hence while building various models we will consider variable Humidity and omit Humidity ratio.

Now we are all set for model building. As our response variable Occupancy is binary, we need classification types of models. Here we implement CART, RF, and ANN.

Model Building- Classification and Regression Trees (CART): Now we define training and test datasets in the required format for model building.

p_train = data1 p_test = read.csv("datatest2.txt",header = T, sep = ",", row.names = 1) p_test$date1 = as_date(p_test$date) p_test$date= as.POSIXct(p_test$date, format = "%Y-%m-%d %H:%M:%S") p_test$time = format(p_test$date, format = "%H:%M:%S") p_test= p_test[,-1]

Note that the R implementation of the CART algorithm is called RPART (Recursive Partitioning And Regression Trees) available in a package of the same name. The algorithm of decision tree models works by repeatedly partitioning/splitting the data into multiple sub-spaces so that the outcomes in each final sub-space are as homogeneous as possible.

The model uses different splitting rules that can be used to effectively predict the type of outcome. These rules are produced by repeatedly splitting the predictor variables, starting with the variable that has the highest association with the response variable. The process continues until some predetermined stopping criteria are met. We define these stopping criteria using control parameters such as a minimum number of observations in a node of the tree before attempting a split, a split must decrease the overall lack of fit by a factor before being attempted.

library(rpart) library(rpart.plot) library(rattle) #Setting the control parameters r.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10) #Building the CART model set.seed(123) m1 <- rpart(formula = Occupancy~ Temperature+Humidity+Light+CO2+date1+time, data = p_train, method = "class", control = r.ctrl) #Displaying the decision tree fancyRpartPlot(m1).

Decision Tree

Now we predict the occupancy variable for the test dataset using predict function.

p_test$predict.class1 <- predict(ptree, p_test[,-6], type="class") p_test$predict.score1 <- predict(ptree, p_test[,-6], type="prob") View(p_test)

Test Data Preview with Actual and Predicted Occupancy

Now evaluate the performance of our model by plotting the ROC curve and building a confusion matrix.

library(ROCR) pred <- prediction(p_test$predict.score1[,2], p_test$Occupancy) perf <- performance(pred, "tpr", "fpr") plot(perf,main = "ROC curve") auc1=as.numeric(performance(pred, "auc")@y.values) library(caret) m1=confusionMatrix(table(p_test$predict.class1,p_test$Occupancy), positive="1",mode="everything")

Here we get an AUC of 82% and a model accuracy of 98.1%.

Now next in the list is Random Forest. In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model. The R package “randomForest” is used to create random forests.

RFmodel = randomForest(Occupancy~Temperature+Humidity+Light+CO2+date1+time, data = p_train1, mtry = 5, nodesize = 10, ntree = 501, importance = TRUE) print(RFmodel) plot(RFmodel)

Error Vs No of trees plot

Here we observe that error remains constant after n=150, so we can tune the model with trees 150.

Also, we can have a look at important variables in the model which are contributing to occupancy detection.

importance(RFmodel)

Variable Importance table

We observe from the above output that light is the most important predictor followed by CO2, Temperature, Time, Humidity, and date when we consider accuracy. Now, let’s prune the RF model with new control parameters.

set.seed(123) tRF=tuneRF(x=p_train1[,-c(5,6)],y=as.factor(p_train1$Occupancy), mtryStart = 5, ntreeTry = 150, stepFactor = 1.15, improve = 0.0001, trace = TRUE, plot = TRUE, doBest = TRUE, nodesize=10, importance= TRUE)

Now with this tuned model we again variable importance as follows. Please note here variable importance is measured for decreasing Gini Index, however earlier it was a mean decrease in model accuracy.

varImpPlot(tRF,type=2, main = "Important predictors in the analysis")

Variable Importance Plot

Now next we predict occupancy for the test dataset using predict function and tuned RF model.

p_test1$predict.class= predict(tRF, p_test1, type= "class") p_test1$predict.score= predict(tRF, p_test1, type= "prob")

We check the performance of this model using the ROC curve and confusion matrix parameters. The AUC turns out to be 99.13% with a very stiff curve as below and the accuracy of prediction is 98.12%.

ROC Curve

It seems like this model is doing better than the CART model. Time to check ANN!

Now we build an artificial neural network for the classification. Neural Network (or Artificial Neural Network) has the ability to learn by examples. ANN is an information processing model inspired by the biological neuron system. It is composed of a large number of highly interconnected processing elements known as the neuron to solve problems.

Here I am using the ‘neuralnet’ package in R. When we try to build ANN for our case, we observe that the model does not accept date class variables we will omit them. Another way could be we create a separate factor variable from the daytime variable with levels like Morning, Afternoon, Evening, and Night and then create a dummy variable for the factor variable.

Before actually building ANN, we need to scale our data as variables have values in different ranges. ANN being a weight-based algorithm, maybe biased results if data is not scaled.

p_train2=p_train2[,-c(6,7,8)] p_train_sc=scale(p_train2) p_train_sc = as.data.frame(p_train_sc) p_train_sc$Occupancy=data1$Occupancy p_test3=p_test2[,-c(6,7,8)] p_test_sc=scale(p_test3) p_test_sc = as.data.frame(p_test_sc) p_test_sc$Occupancy=p_test2$Occupancy

After scaling all the variables (except Occupancy) our data will look like this.

Data Preview After Scaling

Now we are all set for model building.

nn1 = neuralnet(formula = Occupancy~Temperature+Humidity+Light+CO2+HumidityRatio, data = p_train_sc, hidden = 3, chúng tôi = "sse",linear.output = FALSE,lifesign = "full", chúng tôi = 10, threshold = 0.03, stepmax = 10000) plot(nn1)

We calculate results for the test dataset using Compute function.

compute.output = compute(nn1, p_test_sc[,-6]) p_test_sc$Predict.score <- compute.output$net.result

Models Performance Comparison: We again evaluate this model using the confusion matrix and ROC curve. I have tabulated results obtained from all three models as follows:

Performance measure CART on a test dataset RF on a test dataset ANN on a test dataset

AUC 0.8253 0.9913057 0.996836

Accuracy 0.981 0.9812 0.9942

Kappa 0.9429 0.9437 0.9825

Sensitivity 0.9514 0.9526 0.9951

Specificity 0.9889 0.9889 0.9939

Precision 0.9586 0.9587 0.9775

Recall 0.9514 0.9526 0.9951

F1 0.9550 0.9556 0.9862

Balanced Accuracy 0.9702 0.9708 0.9945

From performance measures comparison, we observe that ANN outperforms other models, followed by RF and CART. With machine earning algorithms we can replace occupancy sensor functionality efficiently with good accuracy.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Anomaly Detection On Google Stock Data 2014

Introduction

Welcome to the fascinating world of stock market anomaly detection! In this project, we’ll dive into the historical data of Google’s stock from 2014-2023 and use cutting-edge anomaly detection techniques to uncover hidden patterns and gain insights into the stock market. By identifying outliers and other anomalies, we aim to understand stock market trends better and potentially discover new investment opportunities. With the power of Python and the Scikit-learn library at our fingertips, we’re ready to embark on a thrilling data science journey that could change how we view the stock market forever. So, fasten your seatbelts and get ready to discover the unknown!

Learning Objectives:

In this article, we will:

Explore the data and identify potential anomalies.

Create visualizations to understand the data and its anomalies better.

Construct and train a model to detect anomalous data points.

Analyze and interpret our results to draw meaningful conclusions about the stock market.

This article was published as a part of the Data Science Blogathon.

Table of Contents Understanding the Data and Problem Statement

In this project-based blog, we will explore anomaly detection in Google stock data from 2014-2023. The dataset used in this project is obtained from Kaggle. The dataset is available on Kaggle, and you can download it here. The dataset contains 106 rows and 7 columns. The dataset consists of daily stock price data for Google, also known as Alphabet Inc. (GOOGL), from 2014 to 2023. The dataset contains several features, including the opening, closing, highest, lowest, and volume of shares traded for each day. It also includes the date on which the stock was traded. The dataset contains 106 rows and 7 columns.

Problem statement

This project aims to analyze the Google stock data from 2014-2023 and use anomaly detection techniques to uncover hidden patterns and outliers in the data. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset. Finally, we will analyze and interpret our results to draw meaningful conclusions about the stock market.

Data Preprocessing

Missing values

Python Code:



Finding data points that have a 0.0% change from the previous month’s value:

data[data['Change %']==0.0]

Two data points, 100 and 105, have a 0.0% change.

Changing the ‘Month Starting’ column to a date datatype:

data['Month Starting'] = pd.to_datetime(data['Month Starting'], errors='coerce').dt.date

After converting to this format, we encountered three unexpected missing values. Let’s address these missing values.

#Replacing the missing values after cross verifying data['Month Starting'][31] = pd.to_datetime('2023-05-01') data['Month Starting'][43] = pd.to_datetime('2023-05-01') data['Month Starting'][55] = pd.to_datetime('2023-05-01')

The data is now clean and ready to be analyzed.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important first step in analyzing a dataset, and it involves examining and summarizing the main characteristics of the data. Data visualization is one of the most powerful and widely used tools in EDA. Data visualization allows us to visually explore and understand the patterns and trends in the data, and it can reveal relationships, outliers, and potential errors in the data.

Change in the stock price over the years:

plt.figure(figsize=(25,5)) plt.plot(data['Month Starting'],data['Open'], label='Open') plt.plot(data['Month Starting'],data['Close'], label='Close') plt.xlabel('Year') plt.ylabel('Close Price') plt.legend() plt.title('Change in the stock price of Google over the years')

The stock price has increased since 2023, with a peak enhancement occurring in 2023.

# Calculating the daily returns data['Returns'] = data['Close'].pct_change() # Calculating the rolling average of the returns data['Rolling Average'] = data['Returns'].rolling(window=30).mean() plt.figure(figsize=(10,5)) ''' Creating a line plot using the 'Month Starting' column as the x-axis and the 'Rolling Average' column as the y-axis''' sns.lineplot(x='Month Starting', y='Rolling Average', data=data)

The plot above shows that the rolling mean decreased in 2023 due to an increase in the stock price.

Correlation is a statistical measure that indicates the degree to which two or more variables are related. It is a useful tool in data analysis, as it can help to identify patterns and relationships between variables and to understand the extent to which changes in one variable are associated with changes in another variable. To find the correlation between variables in the data, we can use the in-built function corr(). This will give us a correlation matrix with values ranging from -1.0 to 1.0. The closer a value is to 1.0, the stronger the positive correlation between the two variables. Conversely, the closer a value is to -1.0, the stronger the negative correlation between the two variables. The heatmap will visually represent the correlation intensity between the variables, with darker colors indicating stronger correlations and lighter colors indicating weaker correlations. This can be a helpful way to identify relationships between variables and guide further analysis quickly.

corr = data.corr() plt.figure(figsize=(10,10)) sns.heatmap(corr, annot=True, cmap='coolwarm')

Scaling the returns using StandardScaler

To ensure that the data is normalized to have zero mean and unit variance, we use the StandardScaler from the Scikit-learn library. We first import the StandardScaler class and then create an instance of the class. We then fit the scaler to the Returns column of our dataset using the fit_transform method. This scales our data to have zero mean and unit variance, which is necessary for some machine learning algorithms to function properly.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data['Returns'] = scaler.fit_transform(data['Returns'].values.reshape(-1,1)) data.head()

Handling Unexpected Missing Values

data['Returns'] = data['Returns'].fillna(data['Returns'].mean()) data['Rolling Average'] = data['Rolling Average'].fillna(data['Rolling Average'].mean()) Model Development

Now that the data has been preprocessed and analyzed, we are ready to develop a model for anomaly detection. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset.

We will use the Isolation Forest algorithm to detect anomalies. Isolation Forest is an unsupervised machine learning algorithm that isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated until the anomaly is isolated.

We will use the Scikit-learn library to construct and train our Isolation Forest model. The following code snippet shows how to construct and train the model.

from sklearn.ensemble import IsolationForest model = IsolationForest(contamination=0.05) model.fit(data[['Returns']]) # Predicting anomalies data['Anomaly'] = model.predict(data[['Returns']]) data['Anomaly'] = data['Anomaly'].map({1: 0, -1: 1}) # Ploting the results plt.figure(figsize=(13,5)) plt.plot(data.index, data['Returns'], label='Returns') plt.scatter(data[data['Anomaly'] == 1].index, data[data['Anomaly'] == 1]['Returns'], color='red') plt.legend(['Returns', 'Anomaly']) plt.show() Conclusion

This project-based blog explored anomaly detection in Google stock data from 2014-2023. We used the Scikit-learn library in Python to construct and train an Isolation Forest model to detect anomalous data points within the dataset.

Our model was able to uncover hidden patterns and outliers in the data, and we were able to draw meaningful conclusions about the stock market. We found that the stock price has increased since 2023 and that the rolling mean decreased in 2023. We also found that the Open price correlates more with the Close price than any other feature.

Overall, this project was a great success and has opened up new possibilities for stock market analysis and anomaly detection.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Machine Learning Unlocks Insights For Stress Detection

Introduction

Stress is a natural response of the body and mind to a demanding or challenging situation. It is the body’s way of reacting to external pressures or internal thoughts and feelings. Stress can be triggered by a variety of factors, such as work-related pressure, financial difficulties, relationship problems, health issues, or major life events. Stress detection insights, driven by data science and machine learning, aims to forecast stress levels in individuals or populations. By analyzing a variety of data sources, such as physiological measurements, behavioral data, and environmental factors, predictive models can identify patterns and risk factors associated with stress.

This proactive approach enables timely intervention and tailored support. Stress prediction holds potential in health care for early detection and personalized intervention as well as in occupational settings to optimize work environments. It can also inform public health initiatives and policy decisions. With the ability to predict stress, these models provide valuable insights for improving well-being and increasing resilience in individuals and communities.

This article was published as a part of the Data Science Blogathon.

Overview of Stress Detection Using Machine Learning

Stress detection using machine learning involves collecting, cleaning, and preprocessing data. Feature engineering techniques are applied to extract meaningful information or create new features that can capture patterns related to stress. This may involve extracting statistical measures, frequency domain analysis, or time-series analysis to capture physiological or behavioral indicators of stress. Relevant features are extracted or engineered to enhance performance.

Researchers train machine learning models like logistic regression, SVM, decision trees, random forests, or neural networks by utilizing labeled data to classify stress levels. They evaluate the performance of the models using metrics such as accuracy, precision, recall, and F1-score. Integration of the trained model into real-world applications enables real-time stress monitoring. Continuous monitoring, updates, and user feedback are crucial for improving accuracy.

It is crucial to consider ethical issues and privacy concerns when dealing with sensitive personal data related to stress. Proper informed consent, data anonymization, and secure data storage procedures should be followed to protect individuals’ privacy and rights. Ethical considerations, privacy, and data security are important during the entire process. Machine learning-based stress detection enables early intervention, personalized stress management, and improved well-being.

Data Description

The “stress” dataset contains information related to stress levels. Without the specific structure and columns of the dataset, I can provide a general overview of what a data description for a percentile might look like.

The dataset may contain numerical variables that represent quantitative measurements, such as age, blood pressure, heart rate, or stress levels measured on a scale. It may also include categorical variables that represent qualitative characteristics, such as gender, occupation categories, or stress levels classified into different categories (low, medium, high).

# Array import numpy as np # Dataframe import pandas as pd #Visualization import matplotlib.pyplot as plt import seaborn as sns # warnings import warnings warnings.filterwarnings('ignore') #Data Reading stress_c= pd.read_csv('/human-stress-prediction/Stress.csv') # Copy stress=stress_c.copy() # Data stress.head()

below function is allowing you to quickly assess the data types and find out missing or null values. This summary is useful when working with large datasets or performing data cleaning and preprocessing tasks.

# Info stress.info()

Use the code stress.isnull().sum() to check for null values in the “stress” dataset and calculate the sum of null values in each column.

# Checking null values stress.isnull().sum()

To generate statistical information about the “stress” dataset. By compiling this code, you will get a summary of descriptive statistics for each numerical column in the dataset.

# Statistical Information stress.describe() Exploratory Data Analysis(EDA)

Exploratory Data Analysis (EDA) is a crucial step in understanding and analyzing a dataset. It involves visually exploring and summarizing the main characteristics, patterns, and relationships within the data

lst=['subreddit','label'] plt.figure(figsize=(15,12)) for i in range(len(lst)): plt.subplot(1,2,i+1) a=stress[lst[i]].value_counts() lbl=a.index plt.title(lst[i]+'_Distribution') plt.pie(x=a,labels=lbl,autopct="%.1f %%") plt.show()

The Matplotlib and Seaborn libraries create a count plot for the “stress” dataset. It visualizes the count of stress instances across different subreddits, with the stress labels differentiated by different colors.

plt.figure(figsize=(20,12)) plt.title('Subreddit wise stress count') plt.xlabel('Subreddit') sns.countplot(data=stress,x='subreddit',hue='label',palette='gist_heat') plt.show() Text Preprocessing

Text preprocessing refers to the process of converting raw text data into a more clean and structured format that is suitable for analysis or modeling tasks. It specially involves a series of steps to remove noise, normalize text, and extract relevant features. Here I added all libraries related to this text processing.

# Regular Expression import re # Handling string import string # NLP tool import spacy nlp=spacy.load('en_core_web_sm') from spacy.lang.en.stop_words import STOP_WORDS # Importing Natural Language Tool Kit for NLP operations import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('punkt') nltk.download('omw-1.4') from chúng tôi import WordNetLemmatizer from wordcloud import WordCloud, STOPWORDS from nltk.corpus import stopwords from collections import Counter

Some common techniques used in text preprocessing include:

Text Cleaning

Removing special characters: Remove punctuation, symbols, or non-alphanumeric characters that do not contribute to the meaning of the text.

Removing numbers: Remove numerical digits if they are not relevant to the analysis.

Lowercasing: Convert all text to lowercase to ensure consistency in text matching and analysis.

Removing stop words: Remove common words that do not carry much information, such as “a”, “the”, “is”, etc.

Tokenization

Normalization

Lemmatization: Reduce words to their base or dictionary form (lemmas). For example, converting “running” and “ran” to “run”.

Stemming: Reduce words to their base form by removing prefixes or suffixes. For example, converting “running” and “ran” to “run”.

Removing diacritics: Remove accents or other diacritical marks from characters.

#defining function for preprocessing def preprocess(text,remove_digits=True): text = re.sub('W+',' ', text) text = re.sub('s+',' ', text) text = re.sub("(?<!w)d+", "", text) text=text.lower() nopunc=[char for char in text if char not in string.punctuation] nopunc=''.join(nopunc) nopunc=' '.join([word for word in nopunc.split() if word.lower() not in stopwords.words('english')]) return nopunc # Defining a function for lemitization def lemmatize(words): words=nlp(words) lemmas = [] for word in words: lemmas.append(word.lemma_) return lemmas #converting them into string def listtostring(s): str1=' ' return (str1.join(s)) def clean_text(input): word=preprocess(input) lemmas=lemmatize(word) return listtostring(lemmas) # Creating a feature to store clean texts stress['clean_text']=stress['text'].apply(clean_text) stress.head() Machine Learning Model Building

Machine learning model building is the process of creating a mathematical representation or model that can learn patterns and make predictions or decisions from data. It involves training a model using a labeled dataset and then using that model to make predictions on new, unseen data.

Selecting or creating relevant features from the available data. Feature engineering aims to extract meaningful information from the raw data that can help the model learn patterns effectively.

# Vectorization from sklearn.feature_extraction.text import TfidfVectorizer # Model Building from sklearn.model_selection import GridSearchCV,StratifiedKFold, KFold,train_test_split,cross_val_score,cross_val_predict from sklearn.linear_model import LogisticRegression,SGDClassifier from sklearn import preprocessing from sklearn.naive_bayes import MultinomialNB from chúng tôi import DecisionTreeClassifier from sklearn.ensemble import StackingClassifier,RandomForestClassifier, AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier #Model Evaluation from sklearn.metrics import confusion_matrix,classification_report, accuracy_score,f1_score,precision_score from sklearn.pipeline import Pipeline # Time from time import time # Defining target & feature for ML model building x=stress['clean_text'] y=stress['label'] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)

Choosing an appropriate machine learning algorithm or model architecture based on the nature of the problem and the characteristics of the data. Different models, such as decision trees, support vector machines, or neural networks, have different strengths and weaknesses.

Training the selected model using the labeled data. This step involves feeding the training data to the model and allowing it to learn the patterns and relationships between the features and the target variable.

# Self-defining function to convert the data into vector form by tf idf #vectorizer and classify and create model by Logistic regression def model_lr_tf(x_train, x_test, y_train, y_test): global acc_lr_tf,f1_lr_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = LogisticRegression() #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_lr_tf=accuracy_score(y_test,y_pred) f1_lr_tf=f1_score(y_test,y_pred,average='weighted') print('Time :',time()-t0) print('Accuracy: ',acc_lr_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) return y_test,y_pred,acc_lr_tf # Self defining function to convert the data into vector form by tf idf #vectorizer and classify and create model by MultinomialNB def model_nb_tf(x_train, x_test, y_train, y_test): global acc_nb_tf,f1_nb_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = MultinomialNB() #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_nb_tf=accuracy_score(y_test,y_pred) f1_nb_tf=f1_score(y_test,y_pred,average='weighted') print('Time : ',time()-t0) print('Accuracy: ',acc_nb_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) return y_test,y_pred,acc_nb_tf # Self defining function to convert the data into vector form by tf idf # vectorizer and classify and create model by Decision Tree def model_dt_tf(x_train, x_test, y_train, y_test): global acc_dt_tf,f1_dt_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = DecisionTreeClassifier(random_state=1) #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_dt_tf=accuracy_score(y_test,y_pred) f1_dt_tf=f1_score(y_test,y_pred,average='weighted') print('Time : ',time()-t0) print('Accuracy: ',acc_dt_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) return y_test,y_pred,acc_dt_tf # Self defining function to convert the data into vector form by tf idf #vectorizer and classify and create model by KNN def model_knn_tf(x_train, x_test, y_train, y_test): global acc_knn_tf,f1_knn_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = KNeighborsClassifier() #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_knn_tf=accuracy_score(y_test,y_pred) f1_knn_tf=f1_score(y_test,y_pred,average='weighted') print('Time : ',time()-t0) print('Accuracy: ',acc_knn_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) # Self defining function to convert the data into vector form by tf idf #vectorizer and classify and create model by Random Forest def model_rf_tf(x_train, x_test, y_train, y_test): global acc_rf_tf,f1_rf_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = RandomForestClassifier(random_state=1) #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_rf_tf=accuracy_score(y_test,y_pred) f1_rf_tf=f1_score(y_test,y_pred,average='weighted') print('Time : ',time()-t0) print('Accuracy: ',acc_rf_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) # Self defining function to convert the data into vector form by tf idf # vectorizer and classify and create model by Adaptive Boosting def model_ab_tf(x_train, x_test, y_train, y_test): global acc_ab_tf,f1_ab_tf # Text to vector transformation vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) ovr = AdaBoostClassifier(random_state=1) #fitting training data into the model & predicting t0 = time() ovr.fit(x_train, y_train) y_pred = ovr.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_ab_tf=accuracy_score(y_test,y_pred) f1_ab_tf=f1_score(y_test,y_pred,average='weighted') print('Time : ',time()-t0) print('Accuracy: ',acc_ab_tf) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) Model Evaluation

Model evaluation is a crucial step in machine learning to assess the performance and effectiveness of a trained model. It involves measuring how well the multiple models generalizes to unseen data and whether it meets the desired objectives. Evaluate the trained model’s performance on the testing data. Calculate evaluation metrics such as accuracy, precision, recall, and F1-score to assess the model’s effectiveness in stress detection. Model evaluation provides insights into the model’s strengths, weaknesses, and its suitability for the intended task.

# Evaluating Models print('********************Logistic Regression*********************') print('n') model_lr_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') print('********************Multinomial NB*********************') print('n') model_nb_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') print('********************Decision Tree*********************') print('n') model_dt_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') print('********************KNN*********************') print('n') model_knn_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') print('********************Random Forest Bagging*********************') print('n') model_rf_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') print('********************Adaptive Boosting*********************') print('n') model_ab_tf(x_train, x_test, y_train, y_test) print('n') print(30*'==========') print('n') Model Performance Comparison

This is a crucial step in machine learning to identify the best-performing model for a given task. When comparing models, it is important to have a clear objective in mind. Whether it is maximizing accuracy, optimizing for speed, or prioritizing interpretability, the evaluation metrics and techniques should align with the specific objective.

Consistency is key in model performance comparison. Using consistent evaluation metrics across all models ensures a fair and meaningful comparison. It is also important to split the data into training, validation, and test sets consistently across all models. By ensuring that the models evaluate on the same data subsets, researchers enable a fair comparison of their performance.

Considering these above factors, researchers can conduct a comprehensive and fair model performance comparison, which will lead to informed decisions regarding model selection for the specific problem at hand.

# Creating tabular format for better comparison tbl=pd.DataFrame() tbl['Model']=pd.Series(['Logistic Regreesion','Multinomial NB', 'Decision Tree','KNN','Random Forest','Adaptive Boosting']) tbl['Accuracy']=pd.Series([acc_lr_tf,acc_nb_tf,acc_dt_tf,acc_knn_tf, acc_rf_tf,acc_ab_tf]) tbl['F1_Score']=pd.Series([f1_lr_tf,f1_nb_tf,f1_dt_tf,f1_knn_tf, f1_rf_tf,f1_ab_tf]) tbl.set_index('Model') # Best model on the basis of F1 Score tbl.sort_values('F1_Score',ascending=False) Cross Validation to Avoid Overfitting

Cross-validation is indeed a valuable technique to help avoid overfitting when training machine learning models. It provides a robust evaluation of the model’s performance by using multiple subsets of the data for training and testing. It helps assess the model’s generalization capability by estimating its performance on unseen data.

# Using cross validation method to avoid overfitting import statistics as st vector = TfidfVectorizer() x_train_v = vector.fit_transform(x_train) x_test_v = vector.transform(x_test) # Model building lr =LogisticRegression() mnb=MultinomialNB() dct=DecisionTreeClassifier(random_state=1) knn=KNeighborsClassifier() rf=RandomForestClassifier(random_state=1) ab=AdaBoostClassifier(random_state=1) m =[lr,mnb,dct,knn,rf,ab] model_name=['Logistic R','MultiNB','DecTRee','KNN','R forest','Ada Boost'] results, mean_results, p, f1_test=list(),list(),list(),list() #Model fitting,cross-validating and evaluating performance def algor(model): print('n',i) pipe=Pipeline([('model',model)]) pipe.fit(x_train_v,y_train) cv=StratifiedKFold(n_splits=5) n_scores=cross_val_score(pipe,x_train_v,y_train,scoring='f1_weighted', cv=cv,n_jobs=-1,error_score='raise') results.append(n_scores) mean_results.append(st.mean(n_scores)) print('f1-Score(train): mean= (%.3f), min=(%.3f)) ,max= (%.3f), stdev= (%.3f)'%(st.mean(n_scores), min(n_scores), max(n_scores),np.std(n_scores))) y_pred=cross_val_predict(model,x_train_v,y_train,cv=cv) p.append(y_pred) f1=f1_score(y_train,y_pred, average = 'weighted') f1_test.append(f1) print('f1-Score(test): %.4f'%(f1)) for i in m: algor(i) # Model comparison By Visualizing fig=plt.subplots(figsize=(20,15)) plt.title('MODEL EVALUATION BY CROSS VALIDATION METHOD') plt.xlabel('MODELS') plt.ylabel('F1 Score') plt.boxplot(results,labels=model_name,showmeans=True) plt.show() As F1 scores of the models are coming quite similar in both methods. So now we are applying the Leave One Out method to build the best-performed model. x=stress['clean_text'] y=stress['label'] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1) vector = TfidfVectorizer() x_train = vector.fit_transform(x_train) x_test = vector.transform(x_test) model_lr_tf=LogisticRegression() model_lr_tf.fit(x_train,y_train) y_pred=model_lr_tf.predict(x_test) # Model Evaluation conf=confusion_matrix(y_test,y_pred) acc_lr=accuracy_score(y_test,y_pred) f1_lr=f1_score(y_test,y_pred,average='weighted') print('Accuracy: ',acc_lr) print('F1 Score: ',f1_lr) print(10*'===========') print('Confusion Matrix: n',conf) print(10*'===========') print('Classification Report: n',classification_report(y_test,y_pred)) Word Clouds of Stressed & Non-stressed Words

The dataset contains text messages or documents that are labeled as either stressed or non-stressed. The code loops through the two labels to create a word cloud for each label using the WordCloud library and display the word cloud visualization. Each word cloud represents the most commonly used words in the respective category, with larger words indicating higher frequency. The choice of the color map (‘winter’, ‘autumn’, ‘magma’, ‘Viridis’, ‘plasma’) determines the color scheme of the word clouds. The resulting visualizations provide a concise representation of the most frequent words associated with stressed and non-stressed messages or documents.

Here are word clouds representing stressed and non-stressed words commonly associated with stress detection:

for label, cmap in zip([0,1], ['winter', 'autumn', 'magma', 'viridis', 'plasma']): text = stress.query('label == @label')['text'].str.cat(sep=' ') plt.figure(figsize=(12, 9)) wc = WordCloud(width=1000, height=600, background_color="#f8f8f8", colormap=cmap) wc.generate_from_text(text) plt.imshow(wc) plt.axis("off") plt.title(f"Words Commonly Used in ${label}$ Messages", size=20) plt.show() Prediction

The new input data is preprocessed and features are extracted to match the model’s expectations. The predict function is then used to generate predictions based on the extracted features. Finally, the predictions are printed or utilized as required for further analysis or decision-making.

data=["""I don't have the ability to cope with it anymore. I'm trying, but a lot of things are triggering me, and I'm shutting down at work, just finding the place I feel safest, and staying there for an hour or two until I feel like I can do something again. I'm tired of watching my back, tired of traveling to places I don't feel safe, tired of reliving that moment, tired of being triggered, tired of the stress, tired of anxiety and knots in my stomach, tired of irrational thought when triggered, tired of irrational paranoia. I'm exhausted and need a break, but know it won't be enough until I journey the long road through therapy. I'm not suicidal at all, just wishing this pain and misery would end, to have my life back again."""] data=vector.transform(data) model_lr_tf.predict(data) data=["""In case this is the first time you're reading this post... We are looking for people who are willing to complete some online questionnaires about employment and well-being which we hope will help us to improve services for assisting people with mental health difficulties to obtain and retain employment. We are developing an employment questionnaire for people with personality disorders; however we are looking for people from all backgrounds to complete it. That means you do not need to have a diagnosis of personality disorder – you just need to have an interest in completing the online questionnaires. The questionnaires will only take about 10 minutes to complete online. For your participation, we’ll donate £1 on your behalf to a mental health charity (Young Minds: Child & Adolescent Mental Health, Mental Health Foundation, or Rethink)"""] data=vector.transform(data) model_lr_tf.predict(data) Conclusion

The application of machine learning techniques in predicting stress levels provides personalized insights for mental well-being. By analyzing a variety of factors such as numerical measurements ( blood pressure, heart- rate) and categorical characteristics (eg, gender, occupation), machine learning models can learn patterns and make predictions on an individual stress level. With the ability to accurately detect and monitor stress levels, machine learning contributes to the development of proactive strategies and interventions to manage and enhance mental well-being.

We explored the insights from using machine learning in stress prediction and its potential to revolutionize our approach to addressing this critical issue.

Accurate Predictions: Machine learning algorithms analyze vast amounts of historical data to accurately predict stress occurrences, providing valuable insights and forecasts.

Early Detection: Machine learning can detect warning signs early on, allowing for proactive measures and timely support in vulnerable areas.

Enhanced Planning and Resource Allocation: Machine learning enables forecasting of stree hotspots and intensities, optimizing the allocation of resources such as emergency services and medical facilities.

Improved Public Safety: Timely alerts and warnings issued through machine learning predictions empower individuals to take necessary precautions, reducing the impact of stree and enhancing public safety.

In conclusion, this stress prediction analysis provides valuable insights into stress levels and their prediction using machine learning. Use the findings to develop tools and interventions for stress management, promoting overall well-being and improved quality of life.

Frequently Asked Questions

Related

How To Use Detection Mode In Magnifier On Iphone And Ipad

Things to know before using Detection Mode in iOS 16 and iPadOS 16

While the Magnifier app on iPhone and iPad may be used by many to magnify small texts, you may be surprised to know how much more it can do. Since looking at all of its features will take an eternity (not quite literally, but it has many features), I will help you use Detection Mode in Magnifier on your iPhone.

This feature is mainly catered to visually impaired people. So if you know someone who’s going to benefit from this feature, read along and help them make use of this great feature.

Apple devices that support Detection Mode in Magnifier app

While people detection has been around for quite some time on iPhones, Apple introduced Door Detection with iOS 16 and iPadOS 16. So as you might have guessed, to use this feature, you must ensure that your iPhone or iPad is updated to the latest iOS or iPadOS version.

iPhone 12 Pro and iPhone 12 Pro Max

iPhone 13 Pro and iPhone 13 Pro max

IPhone 14 Pro and iPhone 14 Pro Max

iPad Pro (2023)

iPad Pro (2023)

And if your iPhone has successfully passed these criteria, let’s take a look at what should be done next.

How to enable Detection Mode in iOS 16 and iPadOS 16

While in most cases, the option is enabled by default, it is good to double-check. Here’s how you can turn on Detection Mode on your iPhone and iPad:

Open the Magnifier app.

Tap the gear icon.

Select Settings.

Here, tap the plus icon next to Detection Mode.

If you see a minus icon (-), it’s already enabled.

Tap Done.

Now, you will see the Detection Mode icon while opening the app itself.

Use Door Detection on iPhone and iPad

Open the Magnifier app.

Tap the Detection Mode icon.

Select the Door Detection icon.

You will see a confirmation message on top. To detect doors, move close to any, and you will see your iPhone mentioning the door; it would also mention the distance and the type of door.

Customize Door Detection in Magnifier app

Tap the gear icon.

Select Settings.

Scroll down and tap Door Detection.

On the Door Detection page, you will get the following customization options:

Units: Meters and Feet

Sound Pitch Distance: Here, you can customize the sound feedback from your iPhone when it detects a door at a set distance.

Feedback: Toggle on any of the following options according to your choice.

Sounds

Speech

Haptics

Colour: Customize the color of the door outline.

Back Tap: Once toggled on, you can use the double-tap feature to hear more information about the detected doors.

Door Attributes: Enable this to get more information about the detected doors.

Door Decorations: Provides information about door decorations.

How to use People Detection in iOS 16 and iPadOS 16

Open Magnifier.

Tap the Detection Mode icon.

Select the People Detection icon.

You can see the confirmation message on top, and when you move your iPhone close to a person, it will inform you about the person and how far they are.

Customize People Detection in Magnifier app

Tap the gear icon.

Select Settings.

Scroll down and tap People Detection.

These are the customization options that you will get for People Detection:

Units

Sound Pitch Distance

Feedback

Fix Detection Mode not working on iPhone and iPad

Despite enabling this feature, if you are unable to use Detection Mode on your iPhone and iPad, here are some tips to check.

Make sure you have enabled Detection Mode.

Update your iPhone or iPad to the latest iOS version.

Check if your device is equipped with a LiDAR sensor.

Restart your iPhone or iPad. If that also doesn’t work, try force restart.

While using the app, try to move away or a bit closer to see if that triggers the feature.

That’s it!

Author Profile

Anoop

Anoop loves to find solutions for all your doubts on Tech. When he’s not on his quest, you can find him on Twitter talking about what’s in his mind.

How To Type Z With A Line Through It Symbol In Word

In today’s article, you’ll learn how to use some keyboard shortcuts and other methods to type or insert the Z with a line through it (Ƶ) symbol in Word/Excel using on Windows.

Just before we begin, I’ll like to tell you that you can also use the button below to copy and paste the Z with Stroke sign into your work for free.

However, if you just want to type this symbol on your keyboard, the actionable steps below will show you how.

To type the Z with a line through it in Word, simply press down the Alt key and type 437 using the numeric keypad, then let go of the Alt key. This alt code (437) works only in Microsoft Word.

The below table contains all the shortcuts you need to type this Symbol on the keyboard in Microsoft Word

Symbol NameZ with StrokeSymbol TextƵAlt Code437Shortcut in Word (1)Alt + 437Shortcut in Word (2)01B5, Alt+XShortcut for MacNot Available

To use the shortcut in MS Word, first type the code 01B5, select it and press Alt+X on your keyboard.

The above quick guide provides some useful shortcuts and alt codes on how to type this Sign on both Windows and Mac. However, below are some other methods you can employ to insert this symbol into your work such as Word or Excel document.

Microsoft Office provides several methods for typing Z with a line through it or inserting symbols that do not have dedicated keys on the keyboard.

In this section, I will make available for you five different methods you can use to type or insert this and any other symbol on your PC, like in MS Word.

Without any further ado, let’s get started.

The Z with a line through it alt code is 437.

Even though this Symbol (Ƶ) does not have a dedicated key on the keyboard, you can still type it on the keyboard with the Alt code method. To do this, press and hold the Alt key whilst pressing the Z with a line through it Alt code (437) using the numeric keypad.

This method works in MS Word and Windows only. And your keyboard must also have a numeric keypad.

Below is a break-down of the steps you can take to type the Z with Stroke Sign on your Windows PC:

Place your insertion pointer where you need the Symbol.

Press and hold one of the Alt keys on your keyboard.

Whilst holding on to the Alt key, press the Z with a line through it’s alt code (437). You must use the numeric keypad to type the alt code. If you are using a laptop without the numeric keypad, this method may not work for you. On some laptops, there’s a hidden numeric keypad which you can enable by pressing Fn+NmLk on the keyboard.

Release the Alt key after typing the Z with Stroke Sign Alt code to insert the Symbol into your document.

This is how you may type this symbol in Word using the Alt Code method.

To type this symbol in Word, simply type 01B5 and then press Alt+X to obtain the Ƶ symbol. Alternatively, use the Alt Code shortcut method by pressing down the [Alt] key whilst typing the symbol alt code which is 437.

Note: You must use the numeric keypad to type the alt code. Also, ensure that your Num Lock key is turned on.

Below is a breakdown of the Z with Stroke Symbol shortcut for Windows:

Place the insertion pointer at the desired location.

Press and hold down the Alt key

While pressing down the Alt key, type 437 using the numeric keypad to insert the symbol.

These are the steps you may use to type this Symbol in Word or Excel.

Another easy way to get the Z with a line through it on any PC is to use my favorite method: copy and paste.

All you have to do is to copy the symbol from somewhere like a web page, or the character map for windows users, and head over to where you need the symbol (say in Word or Excel), then hit Ctrl+V to paste.

Below is the symbol for you to copy and paste into your Word document. Just select it and press Ctrl+C to copy, switch over to Microsoft Word, place your insertion pointer at the desired location, and press Ctrl+V to paste.

Ƶ

Alternatively, just use the copy button at the beginning of this post.

For windows users, obey the following instructions to copy and paste the Z with a line through it using the character map dialog box.

Switch to your Microsoft Word or Excel document, place the insertion pointer at the desired location, and press Ctrl+V to paste.

This is how you may use the Character Map dialog to copy and paste any symbol on Windows PC.

Obey the following steps to insert the Z with a line through it Symbol in Word using the insert symbol dialog box.

Close the dialog.

The symbol will then be inserted exactly where you placed the insertion pointer.

These are the steps you may use to insert this and any other symbol Symbol in Word/Excel.

As you can see, there are several different methods you can use to type the Z with Stroke Sign in Microsoft Word.

Using the shortcuts for both Windows and Mac makes the fastest option for this task. Shortcuts are always fast.

Thank you very much for reading this blog.

Update the detailed information about Outliers Detection Using Iqr, Z on the Minhminhbmm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!