Trending February 2024 # Anomaly Detection On Google Stock Data 2014 # Suggested March 2024 # Top 11 Popular

You are reading the article Anomaly Detection On Google Stock Data 2014 updated in February 2024 on the website Minhminhbmm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Anomaly Detection On Google Stock Data 2014

Introduction

Welcome to the fascinating world of stock market anomaly detection! In this project, we’ll dive into the historical data of Google’s stock from 2014-2024 and use cutting-edge anomaly detection techniques to uncover hidden patterns and gain insights into the stock market. By identifying outliers and other anomalies, we aim to understand stock market trends better and potentially discover new investment opportunities. With the power of Python and the Scikit-learn library at our fingertips, we’re ready to embark on a thrilling data science journey that could change how we view the stock market forever. So, fasten your seatbelts and get ready to discover the unknown!

Learning Objectives:

In this article, we will:

Explore the data and identify potential anomalies.

Create visualizations to understand the data and its anomalies better.

Construct and train a model to detect anomalous data points.

Analyze and interpret our results to draw meaningful conclusions about the stock market.

This article was published as a part of the Data Science Blogathon.

Table of Contents Understanding the Data and Problem Statement

In this project-based blog, we will explore anomaly detection in Google stock data from 2014-2024. The dataset used in this project is obtained from Kaggle. The dataset is available on Kaggle, and you can download it here. The dataset contains 106 rows and 7 columns. The dataset consists of daily stock price data for Google, also known as Alphabet Inc. (GOOGL), from 2014 to 2023. The dataset contains several features, including the opening, closing, highest, lowest, and volume of shares traded for each day. It also includes the date on which the stock was traded. The dataset contains 106 rows and 7 columns.

Problem statement

This project aims to analyze the Google stock data from 2014-2024 and use anomaly detection techniques to uncover hidden patterns and outliers in the data. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset. Finally, we will analyze and interpret our results to draw meaningful conclusions about the stock market.

Data Preprocessing

Missing values

Python Code:



Finding data points that have a 0.0% change from the previous month’s value:

data[data['Change %']==0.0]

Two data points, 100 and 105, have a 0.0% change.

Changing the ‘Month Starting’ column to a date datatype:

data['Month Starting'] = pd.to_datetime(data['Month Starting'], errors='coerce').dt.date

After converting to this format, we encountered three unexpected missing values. Let’s address these missing values.

#Replacing the missing values after cross verifying data['Month Starting'][31] = pd.to_datetime('2024-05-01') data['Month Starting'][43] = pd.to_datetime('2024-05-01') data['Month Starting'][55] = pd.to_datetime('2024-05-01')

The data is now clean and ready to be analyzed.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important first step in analyzing a dataset, and it involves examining and summarizing the main characteristics of the data. Data visualization is one of the most powerful and widely used tools in EDA. Data visualization allows us to visually explore and understand the patterns and trends in the data, and it can reveal relationships, outliers, and potential errors in the data.

Change in the stock price over the years:

plt.figure(figsize=(25,5)) plt.plot(data['Month Starting'],data['Open'], label='Open') plt.plot(data['Month Starting'],data['Close'], label='Close') plt.xlabel('Year') plt.ylabel('Close Price') plt.legend() plt.title('Change in the stock price of Google over the years')

The stock price has increased since 2023, with a peak enhancement occurring in 2023.

# Calculating the daily returns data['Returns'] = data['Close'].pct_change() # Calculating the rolling average of the returns data['Rolling Average'] = data['Returns'].rolling(window=30).mean() plt.figure(figsize=(10,5)) ''' Creating a line plot using the 'Month Starting' column as the x-axis and the 'Rolling Average' column as the y-axis''' sns.lineplot(x='Month Starting', y='Rolling Average', data=data)

The plot above shows that the rolling mean decreased in 2023 due to an increase in the stock price.

Correlation is a statistical measure that indicates the degree to which two or more variables are related. It is a useful tool in data analysis, as it can help to identify patterns and relationships between variables and to understand the extent to which changes in one variable are associated with changes in another variable. To find the correlation between variables in the data, we can use the in-built function corr(). This will give us a correlation matrix with values ranging from -1.0 to 1.0. The closer a value is to 1.0, the stronger the positive correlation between the two variables. Conversely, the closer a value is to -1.0, the stronger the negative correlation between the two variables. The heatmap will visually represent the correlation intensity between the variables, with darker colors indicating stronger correlations and lighter colors indicating weaker correlations. This can be a helpful way to identify relationships between variables and guide further analysis quickly.

corr = data.corr() plt.figure(figsize=(10,10)) sns.heatmap(corr, annot=True, cmap='coolwarm')

Scaling the returns using StandardScaler

To ensure that the data is normalized to have zero mean and unit variance, we use the StandardScaler from the Scikit-learn library. We first import the StandardScaler class and then create an instance of the class. We then fit the scaler to the Returns column of our dataset using the fit_transform method. This scales our data to have zero mean and unit variance, which is necessary for some machine learning algorithms to function properly.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data['Returns'] = scaler.fit_transform(data['Returns'].values.reshape(-1,1)) data.head()

Handling Unexpected Missing Values

data['Returns'] = data['Returns'].fillna(data['Returns'].mean()) data['Rolling Average'] = data['Rolling Average'].fillna(data['Rolling Average'].mean()) Model Development

Now that the data has been preprocessed and analyzed, we are ready to develop a model for anomaly detection. We will use the Scikit-learn library in Python to construct and train a model to detect anomalous data points within the dataset.

We will use the Isolation Forest algorithm to detect anomalies. Isolation Forest is an unsupervised machine learning algorithm that isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated until the anomaly is isolated.

We will use the Scikit-learn library to construct and train our Isolation Forest model. The following code snippet shows how to construct and train the model.

from sklearn.ensemble import IsolationForest model = IsolationForest(contamination=0.05) model.fit(data[['Returns']]) # Predicting anomalies data['Anomaly'] = model.predict(data[['Returns']]) data['Anomaly'] = data['Anomaly'].map({1: 0, -1: 1}) # Ploting the results plt.figure(figsize=(13,5)) plt.plot(data.index, data['Returns'], label='Returns') plt.scatter(data[data['Anomaly'] == 1].index, data[data['Anomaly'] == 1]['Returns'], color='red') plt.legend(['Returns', 'Anomaly']) plt.show() Conclusion

This project-based blog explored anomaly detection in Google stock data from 2014-2024. We used the Scikit-learn library in Python to construct and train an Isolation Forest model to detect anomalous data points within the dataset.

Our model was able to uncover hidden patterns and outliers in the data, and we were able to draw meaningful conclusions about the stock market. We found that the stock price has increased since 2023 and that the rolling mean decreased in 2023. We also found that the Open price correlates more with the Close price than any other feature.

Overall, this project was a great success and has opened up new possibilities for stock market analysis and anomaly detection.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

You're reading Anomaly Detection On Google Stock Data 2014

Stock Photography Impact On Seo

In a recent tweet someone asked Google’s John Mueller if using stock images affects rankings. The person related that a friend has the opinion that using non-original content such as stock photography can indeed affect rankings.

Mueller answered:

“It doesn’t matter for web-search directly.”

Some might find that answer a little vague. But it accurately communicates that stock images don’t matter for regular web search.

The search results are full of top ranked sites that use stock images. Many of Search Engine Journal’s featured images are stock photography and they rank perfectly fine.

There’s a longstanding SEO idea that non-original content is a negative thing that can affect rankings. Obviously there is some truth to non-original content being unable to rank.

But there’s a certain point where that truth can get stretched and used in a different context it’s not meant to apply to.

Stock photography is non-original content. But if Google took off marks for the use of those, half the Internet would never rank.

Mueller’s answer continued:

“For image search, if it’s the same image as used in many places, it’ll be harder. (There’s also the potential impact on users, after search happens, eg: does it affect conversions if your team photo is obvious stock photography?)”

How to Use Stock Photography for SEO

There are probably many opinions on the best use of stock photography.

It’s probably a good idea to first stop and consider how it is appropriate for your situation.

Does the Image Accurately Represent the Topic?

Sometimes images are used metaphorically and that’s not always the best choice.

For example, it is arguably a lost opportunity to use images of race cars in an article about the “race to succeed.”

That’s an instance of using an image that takes the word “race” in a context that’s different from the topic.

It’s not necessarily a bad choice but there could be a better choice.

A better choice could be an aspirational image, one that shows the success that can result from using the product.

That can be a successful holiday meal, a person graduating from college, whatever the outcome from the use of the product is can be a conversion-oriented image.

Perhaps the important consideration is that the audience can see themselves in the image or can relate to it.

That’s a strategic use of an image to help a page convert a visitor into a consumer or whatever the goal of the webpage is.

Related: 41 Best Stock Photo Sites to Find High-Quality FREE Images

Would the Image Work Well In a Featured Snippet?

This is a good test of how relevant an image is to the topic. Studying the images used in featured snippets is a good way to improve your understanding of what it means for an image to be relevant to the topic of a web page.

Will a strategic use of a stock image help it rank in a featured snippet? That’s debatable.

I find that images with close ties to the topic, that can communicate an answer tend to work well for me. That means using original images or updating and improving on stock images.

Does the Image Help Conversions?

I’m consulting for an organization that was using stock images showing a mix of people who typically use their product.

It turns out that most of the clients are a specific gender. And the people making purchases are largely from a home office type situation.

My suggestion was to increase the use of images featuring that gender and to use images with settings that represent the typical users of their product. That way the potential consumer can see themselves into the picture and understand how much better their lives would be if they made the purchase.

Stock Images and SEO

Using stock photography won’t ruin your SEO. However the strategic use of stock images can help a page be more relevant for the topic that it’s about.

Read the full Twitter discussion here:

It doesn’t matter for web-search directly. For image search, if it’s the same image as used in many places, it’ll be harder. (There’s also the potential impact on users, after search happens, eg: does it affect conversions if your team photo is obvious stock photography?)

— 🦇 johnmu: cats are not people 🦇 (@JohnMu) June 27, 2023

How To Use Detection Mode In Magnifier On Iphone And Ipad

Things to know before using Detection Mode in iOS 16 and iPadOS 16

While the Magnifier app on iPhone and iPad may be used by many to magnify small texts, you may be surprised to know how much more it can do. Since looking at all of its features will take an eternity (not quite literally, but it has many features), I will help you use Detection Mode in Magnifier on your iPhone.

This feature is mainly catered to visually impaired people. So if you know someone who’s going to benefit from this feature, read along and help them make use of this great feature.

Apple devices that support Detection Mode in Magnifier app

While people detection has been around for quite some time on iPhones, Apple introduced Door Detection with iOS 16 and iPadOS 16. So as you might have guessed, to use this feature, you must ensure that your iPhone or iPad is updated to the latest iOS or iPadOS version.

iPhone 12 Pro and iPhone 12 Pro Max

iPhone 13 Pro and iPhone 13 Pro max

IPhone 14 Pro and iPhone 14 Pro Max

iPad Pro (2024)

iPad Pro (2024)

And if your iPhone has successfully passed these criteria, let’s take a look at what should be done next.

How to enable Detection Mode in iOS 16 and iPadOS 16

While in most cases, the option is enabled by default, it is good to double-check. Here’s how you can turn on Detection Mode on your iPhone and iPad:

Open the Magnifier app.

Tap the gear icon.

Select Settings.

Here, tap the plus icon next to Detection Mode.

If you see a minus icon (-), it’s already enabled.

Tap Done.

Now, you will see the Detection Mode icon while opening the app itself.

Use Door Detection on iPhone and iPad

Open the Magnifier app.

Tap the Detection Mode icon.

Select the Door Detection icon.

You will see a confirmation message on top. To detect doors, move close to any, and you will see your iPhone mentioning the door; it would also mention the distance and the type of door.

Customize Door Detection in Magnifier app

Tap the gear icon.

Select Settings.

Scroll down and tap Door Detection.

On the Door Detection page, you will get the following customization options:

Units: Meters and Feet

Sound Pitch Distance: Here, you can customize the sound feedback from your iPhone when it detects a door at a set distance.

Feedback: Toggle on any of the following options according to your choice.

Sounds

Speech

Haptics

Colour: Customize the color of the door outline.

Back Tap: Once toggled on, you can use the double-tap feature to hear more information about the detected doors.

Door Attributes: Enable this to get more information about the detected doors.

Door Decorations: Provides information about door decorations.

How to use People Detection in iOS 16 and iPadOS 16

Open Magnifier.

Tap the Detection Mode icon.

Select the People Detection icon.

You can see the confirmation message on top, and when you move your iPhone close to a person, it will inform you about the person and how far they are.

Customize People Detection in Magnifier app

Tap the gear icon.

Select Settings.

Scroll down and tap People Detection.

These are the customization options that you will get for People Detection:

Units

Sound Pitch Distance

Feedback

Fix Detection Mode not working on iPhone and iPad

Despite enabling this feature, if you are unable to use Detection Mode on your iPhone and iPad, here are some tips to check.

Make sure you have enabled Detection Mode.

Update your iPhone or iPad to the latest iOS version.

Check if your device is equipped with a LiDAR sensor.

Restart your iPhone or iPad. If that also doesn’t work, try force restart.

While using the app, try to move away or a bit closer to see if that triggers the feature.

That’s it!

Author Profile

Anoop

Anoop loves to find solutions for all your doubts on Tech. When he’s not on his quest, you can find him on Twitter talking about what’s in his mind.

Outliers Detection Using Iqr, Z

This article was published as a part of the Data Science Blogathon.

Introduction

data? Which methods will work well if data density is not the same through the dispersion? Identification of outliers? Etc. 

Guys, this article will help you to get many such questions answered along with practical applications, no matter if you are doing a data cleaning process before conducting EDA,/ passing data to a Machine learning model, o ng any statistical test.

What are Inliers and Outliers?

Ou are values that seem excessively different from most of the rest of the values in the given dataset. Outliers could normally exist due to new inventions (true outliers), development of new patterns/phenomena, experimental errors, rarely occurring incidents, anomalies, incorrectly fed data due to typographical mistakes, failure of data recording systems/components, etc. However, all outliers are not bad; some reveal new information. Inliers are all the data points that are part of the distribution except outliers. 

Outlier’s Identification

Collective Outliers: When a Group of datapoint deviates from the distribution, it is called a collective outlier. It is completely subjective to interpret their relevance according to the specific domain. Also, collective outliers show the formation of new phenomena or development. Ref. 

Contextual Outliers: These are specific conditions based on where the interpretation of its relevance becomes (i.e., the usual temperature in Leh during winter goes near 9°C which is the rarest phenomenon in Ahmedabad, Gujarat), Punctuation symbols while attempting text analysis,  background noise single while doing speech recognition, etc.)

Fig: 1 (Point/Global or collective Outliers)

For ease of understanding, I have considered a real case study on steel scrap sales over three years.

Real Case Example of Outliers

Considering a real-case scenario of Steel Sheet Scrap Rate (Rs/Kg) sold across India from 2023 to 2023 has been captured to understand the statistics and predict the price in the future. Still, before doing that, as part of the data-cleaning process, we want to understand the presence of an outlier and its weightage accordingly. 

Importing important libraries to load the dataset and conduct further analysis:

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as st %matplotlib inline import warnings warnings.filterwarnings('ignore') df=pdf.read_excel("scrap_data.xlsx", skiprows=2) df.head(), print('shape of data:',df.shape)

To understand the trend, I have tried to plot line plot on two main independent

variables (‘Scrap Rate’ and ‘Scrap Weight’)  with ref. to their date of sale. 

plt.figure(figsize =(15,5)) plt.subplot(1,2,1) sns.lineplot(x=df['Job Start Date'], y=df['Rate in Rs./Kg.'], color='r') plt.title("Steel Scrap Rate (Rs/Kg)", fontsize=20) plt.xlabel('Date') plt.subplot(1,2,2) sns.lineplot(x=df['Job Start Date'], y=df['Scrape Sale Qty.'], color='b') plt.title("Steel Scrap Weight (Rs/Kg)", fontsize=20) plt.xlabel('Date')

Looking at the trend in the Scrap Rate feature, we understand that there were sudden spikes in rates crossing 120 Rs/kg, indicating anomalies as the scrap rate must be the same and increase or decrease gradually. However, in the case of Scrap weight, depending on the size of a construction project, scrap generated at the end of closing the project may be high or low in volume anytime.

Let’s try applying the different methods of detecting and treating outliers:

Inter Quartile Range (IQR)

IQR measures variability by dividing the dataset into four equal quartiles. First, the entire data is to be sorted in ascending order, and then splitting it into four equal quartiles called Q1, Q2, Q3, and Q4, which can be calculated using the following equation. IQR Method is best suited when the data forms a skewed distribution.

The first Quartile (Q1) divides the smallest 25% of the values from the other 75% that are larger.

The Third quantile (Q3) divides the smallest 75%  of the values from the largest 25%.

Lower Bound Limit = Q1 – 1.5 x IQR

Upper Bound Limit = Q3 + 1.5 x IQR 

So outliers can be considered any values which are greater than Upper Bound Limit (Q3+1.5*IQR) and less than Lower Bound Limit (Q1-1.5*IQR) in the given dataset.

Let’s plot Boxplot to know the presence of outliers;

plt.figure(figsize=(15,5)) plt.subplot(1,2,1) sns.boxplot(df['Scrape Sale Qty.']) plt.xticks(fontsize = (12)) plt.xlabel('Steel-Scrap Weight (in Kgs)') plt.legend (title="Steel Scrap Weight", fontsize=10, title_fontsize=15) plt.subplot(1,2,2) sns.boxplot(df['Rate in Rs./Kg.']) plt.xlabel('Steel Scrap Rate Rs/kg') plt.xticks(fontsize =(12)); plt.legend (title="Steel Scrap Rate", fontsize=10, title_fontsize=15);

To make the calculation faster, I have created a function to derive Inter-Quartile-Range (IQR), Lower Fence, and Upper Fence and added conditions either to drop them or fill them with upper or lower values, respectively. 

def identifying_treating_outliers(df,col,remove_or_fill_with_quartile): q1=df[col].quantile(0.25) q3=df[col].quantile(0.75) iqr=q3-q1 lower_fence=q1-1.5*(iqr) upper_fence=q3+1.5*(iqr) print('Lower Fence;', lower_fence) print('Upper Fence:', upper_fence) print('Total number of outliers are left:', df[df[col] upper_fence].shape[0]) if remove_or_fill_with_quartile=="drop": df.drop(df.loc[df[col]<lower_fence].index,inplace=True) elif remove_or_fill_with_quartile=="fill": df[col] = np.where(df[col] < lower_fence, lower_fence, df[col])

Applying the Function to the Scrap Rate and Scrap Weight column:

identifying_treating_outliers(df,'Scrape Sale Qty.','drop') identifying_treating_outliers(df,'Rate in Rs./Kg.','drop')

DF shape before Application of Function : (1001, 5)

DF Shape after Application of Function : (925, 5)

Plotting Boxplot to check the status of outliers after applying the ‘indentifying_treating_outliers’ function: 

plt.figure(figsize=(15,5))

plt.subplot(1,2,1) sns.boxplot(df['Scrape Sale Qty.']) plt.xticks(fontsize = (12)) plt.xlabel('Steel-Scrap Weight (in Kgs)') plt.legend (title="Steel Scrap Weight", fontsize=10, title_fontsize=15) plt.subplot(1,2,2) sns.boxplot(df['Rate in Rs./Kg.']) plt.xlabel('Steel Scrap Rate Rs/kg') plt.xticks(fontsize =(12)); plt.legend (title="Steel Scrap Rate", fontsize=10, title_fontsize=15); Z-score Method

TheZ-score of the values is the difference between that value and the mean, divided by the standard deviation. Z-Scores help identify outliers by values if a particular data point has a Z-score value either less than -3 or greater than +3.Z score can be mathematically expressed as;

= particular value, =mean, =standard deviation

Below pic expresses the transformation of data from normal distribution to a standard normal distribution using a Z-score is given here for ref.

In our dataset, we will apply the Zscore for outliers with a Zscore of more than +3 and less than -3. Just a few lines of code will help us get Zscore, and we can see the differences using the distribution plot (before and after).

# Applying Zscore in Scrap Rate column defining dataframe by dfn zr = st.zscore(df['Rate in Rs./Kg.']) dfn = df[(zr-3)] # Applying Zscore in Steel Weight Column defining dataframe by dfnf zw= st.zscore(dfn['Scrape Sale Qty.']) dfnf = dfn[(zw-3)] plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.distplot(df['Rate in Rs./Kg.']) plt.title('Z Score Plot Before Removing Outlier',fontsize=15) plt.subplot(1,2,2) sns.distplot(st.zscore(dfn['Rate in Rs./Kg.'])) plt.title('Z Score Plot After Removing Outlier',fontsize=15)

Our data forms a Positive Skewed distribution (skewness value- 0.874) which cannot be considered approximately normally distributed through the above plot. Significant improvement can be seen comparing the plot shown before and after applying Zscore.

print('before df shape', df.shape) print('After df shape for Observation dropped in Scrap Rate', dfn.shape) print('After df shape for observation dropped in weight', dfnf.shape)

Using the Z Score method, in Scrap Rate and Scrap Weight columns, we have dropped 21 data points (3 from Scrap Rate and 18 from Scrap Weight) with Zscore -3.

Local Outliers Finder (LOF)

Local Outlier Finder is an unsupervised machine learning technique to detect outliers based on the closest neighborhood density of data points and works well when the spread of the dataset (density) is not the same. LOF basically considers  K-distance (distance between the points) and K-neighbors (set of points lies in the circle of K-distance (radius)). Ref. detailed documentation: SK-Learn  Library.

Lof takes two major parameters into consideration (1) n_neighbors: The number of neighbors which has a default value of 20 (2) Contamination: the proportion of outliers in the given dataset which can be set ‘auto’ or float values (0, 0.02, 0.005).

Importing important libraries and defining model

from sklearn.neighbors import LocalOutlierFactor d2 = df.values #converting the df into numpy array lof = LocalOutlierFactor(n_neighbors=20, contamination='auto') good = lof.fit_predict(d2) == 1 plt.figure(figsize=(10,5)) plt.scatter(d2[good, 1], d2[good, 0], s=2, label="Inliers", color="#4CAF50") plt.scatter(d2[~good, 1], d2[~good, 0], s=8, label="Outliers", color="#F44336") plt.title('Outlier Detection using Local Outlier Factor', fontsize=20) plt.legend (fontsize=15, title_fontsize=15)

In our case, I have set contamination as ‘auto’ (see the above plot) to see the result and found LOF is not performing that well as my data spread (density) is not deviating much. Also, I tried different Contamination values of 0.005, 0.01, 0.02, 0.05, and 0.09 but the performance was not that well.

Density-Based Spatial Clustering for Application with Noise (DBSCAN)

When our dataset is large enough that have multiple numeric features (multivariate) then it becomes difficult to handle outliers using IQR, Zscore, or LOF. Here SK-Learn library DBSCAN comes to the rescue to allow us to handle outliers for the Multi-variate datasets.

DBSCAN considers two main parameters (as mentioned below) to form a cluster with the nearest data point and based on the high or low-density region, it detects Inliers or outliers.

(1) Epsilon (Radius of datapoint that we can calculate based on k-distance graph)

However, in our case, we don’t have more than 5 features and we have just selected two important numeric features out of them to apply our learnings and visualize the same. Due to the technology & human brain’s limitation in visualizing the Multi-dimensional data altogether at the moment, we are applying DBSCAN to our dataset.

Importing the libraries and fitting the model. To nullify the noise in the data set, we have u normalized the data using Min-Max Scaler.

from sklearn.cluster import DBSCAN from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler() df[['Scrape Sale Qty.','Rate in Rs./Kg.']] = mms.fit_transform(df[['Scrape Sale Qty.','Rate in Rs./Kg.']]) df.head() from sklearn.neighbors import NearestNeighbors neigh = NearestNeighbors(n_neighbors=2) nbrs = neigh.fit(df[['Scrape Sale Qty.', 'Rate in Rs./Kg.']]) distances, indices = nbrs.kneighbors(df[['Rate in Rs./Kg.', 'Rate in Rs./Kg.']]) # Plotting K-distance Graph distances = np.sort(distances, axis=0) distances = distances[:,1] plt.figure(figsize=(8,5)) plt.plot(distances) plt.title('K-distance Graph',fontsize=20) plt.xlabel('Data Points sorted by distance',fontsize=14) plt.ylabel('Epsilon',fontsize=14) plt.show()

The above plot shows the Maximum Epsilon value is closing to 0.08, and for the sample size (number of points we want within the epsilon value of each data point), we are selecting 10 now. 

model = DBSCAN(eps = 0.08, min_samples = 10).fit(data) colors = model.labels_ plt.figure(figsize=(10,7)) plt.scatter(df['Rate in Rs./Kg.'], df['Scrape Sale Qty.'], c = colors) plt.title('Outliers Detection using DBSCAN',fontsize=20)

DBSCAN technique has efficiently detected the significant outliers using Density-Based Spatial Clustering and can be seen in the below plots.

Conclusion

IQR is the simplest and most mathematically explained technique. It is good for univariate and bivariate data to identify outliers as it considers the median as a measure of dispersion to detect extreme values but is limited to multivariate datasets while dealing with huge numbers of numeric features. In our case, we have applied it by defining a function to detect and treat outliers and detected 76 dropped data points as outliers.

DBSCAN does not require to define by a number of clusters and is able to detect anomalies where data spread is arbitrarily distributed and linearly not separable. It has its own limitations while working with varying density data spread. In our case, it been detected 16 datapoints as potential outliers.

Happy learning !!

For further details, Connect me;

[email protected]

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Google Products & Services Data Portfolio

In a tech world of constant evolution, Google Cloud remains one of the data market’s standout performers.

Google faces intense competition from the likes of Microsoft, AWS, Oracle, and SAP, and it continues to grow its data offerings. Let’s review some of the core data products in its large portfolio:

Google Cloud’s migration allows you to migrate databases to Cloud SQL from on-premises, Google Compute Engine and other clouds. The process offers migration with minimal and downtime needed.

Key features include:

Ease of use:

A guided experience takes you through every step of the way.

Server-less experience:

No servers to provision, manage, or monitor, ensuring uninterrupted data replication at scale.

Open source and compatible:

Is compatible with other systems, such as  MySQL and PostgreSQL. This allows you to migrate data without surprises.

Google Cloud’s databases offer a central and secure location to manage your organization’s data and allows for real-time data analytics from any location. 

The database line, Google says, is an opportunity to tap in the database innovation which powers Google’s own underlying business.

They offer a range of features, including:

Bigtable:

Which allows you to manage column databases capable of managing billions of rows and storing single line data, such as financials, marketing and IoT information.

Firebase real-time database:

Provides real-time data services ideal for handling simple datasets.

Firestore:

A more flexible and

scalable NoSQL database option

which is useful for storing more complex data.

The multi-cloud analytics are fully managed giving companies the ability to access high-performance analytics capabilities at scale, using real-time insights.

They provide fast and accessible ways for enterprises to upgrade to a solution which is truly fit for the digital transformation. 

This enables businesses to solve problems such as dark data, in which data within an organization may be difficult to discover, silos, and latency. 

With the Google Cloud behind them, users can scale their data and analytics capacities to power business decisions.

See more: AWS Data Portfolio Review

Google Cloud’s data portfolio has been used by some of the biggest companies in the world.

Working with Freedom Financial, Google Cloud helped the client migrate from the monolithic, slow database they had been using to a modern solution, offering the speed and convenience of the cloud. This would enable Freedom Financial to expand its suite of consumer products.

Before making the move, their data was hosted through another provider with a complex on-premises architecture divided by three units, each of which used about 600 GB of space split across three clusters. The structure would have made it difficult and labor intensive to upgrade them, with each cluster affecting the majority of their engineering teams.

Using Google’s migration service, Freedom Financial achieved a migration with minutes’ worth of downtime rather than hours.

“From the time that we told DMS to dump and load to the Cloud SQL instance to completion, they were all done and fully synchronized within 12–13 hours,” Freedom’s principal engineer says. “We’d kick one off in the afternoon, and by the time we got back the next morning, it was done. We’d actually been setting aside a few days for this task, so this was a great improvement.” 

AirAsia used Google Cloud to become a “data-first” business, enabling them to capture and analyze vastly increased quantities of data in real-time. In order to ensure they were using data to become as agile and innovative as they could, they used Google’s Big Data Query solution to find and extract data.

This allowed them to collect and retrieve data to be visualized through Google Data Studio. This has given them a fast and secure base to deliver insights and scale new features.

Google’s Cloud products enjoy some of the most favorable user ratings in the market.

The Google Cloud SQL database, for instance, scores an average of four stars out of five at TrustRadius, with users praising its security, agility and multi-platform recovery ability. 

However, some users complain that many areas of the solution could be developed further, and the dashboard could have better customization capabilities.

Google’s data has been recognized with a number of industry awards including:

Gartner named Google Cloud as a leader in the Magic Quadrant for Cloud Database Management Systems

2024

.

Google was also named a cloud leader for its analytics by Forrester Research in the Streaming Analytics Report for

2024

.

See more: SAP Data Portfolio Review

How To Link Data Between Google Sheets

First, select the cell where you want the imported data to appear then type = followed by the name of the sheet you want to link to and the cell you want to link So in our case we’ll link the data in cell A1 from “Sheet2”:

='Sheet2'!A1

That data will now appear in your first sheet.

If you prefer to pull a whole column, you can type your equivalent of the following:

={'Sheet2'!A1:A9} How to Link Data Using IMPORTRANGE

The “range string” is the name of the exact sheet you’re pulling data from (called “Sheet1,” “Sheet2,” etc. by default), followed by a ‘!’ and the range of cells you want to pull data from.

Here is the sheet we’ll be pulling data from:

Using QUERY to Import Data More Conditionally

IMPORTRANGE is fantastic for moving bulk data between sheets, but if you want to be more specific about what you want to import, then the Query function is probably what you’re looking for. This will search the source sheet for certain words or conditions you set, then pull corresponding data from the same row or column.

So for our example we’ll again pull data from the below sheet, but this time we’re going to grab only the “Units Sold” data from Germany.

To grab the data we want, we’ll need to type the following:

=QUERY( ImportRange( "1ByTut9xooZdPIBF55gzQ0Cdi04owDTtLVc_gPGtOKY0", "Sheet1!A1:O1000" ) , "select Col5 where Col2 = 'Germany'")

Here, the “ImportRange” data follows exactly the same syntax as before, but now we’re prefixing it with QUERY(, and afterwards we’re telling it to select column 5 (the “Units Sold” column) where the data in column 2 says “Germany”. So there are two “arguments” within the query – ImportRange and select ColX where ColY = 'Z'.

Robert Zak

Content Manager at Make Tech Easier. Enjoys Android, Windows, and tinkering with retro console emulation to breaking point.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.

By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time.

Update the detailed information about Anomaly Detection On Google Stock Data 2014 on the Minhminhbmm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!