28 Data Analysis Projects to Boost Your Skills [2023 Guide]
In this article
Data analytics projects showcase the analytics process, from finding data sources to cleaning and processing data. If you’re searching for your first data analysis job, projects allow you to gain experience using different data analytics tools and techniques. The best projects answer unexpected questions and explore relationships that aren’t immediately intuitive. In this post, we’ll tell you how to create data analytics projects that make you immediately hirable.
What’s the Point of a Data Analysis Project?
Doing data analysis projects is critical to landing a job, as they show hiring managers that you have the skills for the role. Professionals in this field must master a myriad of skills, from data cleaning and data visualization, as well as programming languages like SQL, R, and Python. A data analysis project can demonstrate your aptitude with all of these skills. Furthermore, personal projects are a great way to practice a variety of data analysis techniques, especially if you lack real-world experience.
Data Analysis Projects for Beginners
Projects are an excellent way to gain experience with the end-to-end data analysis process, especially if you’re new to the field of data analysis. Here are some great project ideas for beginners:
Web Scraping
Web scraping is the extraction of data—such as images, user reviews, or product descriptions—from web pages. This information is first collected, then formatted. Web scraping can be done by writing custom scripts in Python, or by using an API or web scraping tool such as ParseHub. Here are two popular ways to practice web scraping:
Reddit is a popular repository for web scraping because of the sheer amount of data available— from qualitative data in posts and comments to user metadata and engagement with each post.
Subreddits on Twitter enable you to extract posts on specific topics. PRAW is a Python package you can use to access Reddit’s API to scrape the subreddits you’re interested in (a Reddit account is required to get an API key). You can then extract data from one or more subreddits at a time. If you’d rather not scrape your own data, you can find Reddit datasets on data.world.
Real Estate
If you’re interested in real estate, you can use Python to scrape data on real-estate properties, then create a dashboard to analyze the “best” properties based on data points like property taxes, population, schools, and public transportation. There are two main Python libraries for data scraping: Scrapy and BeautifulSoup. You can also use the Zillow API to obtain real estate and mortgage data.
Exploratory Data Analysis
Another great project for beginners is to do an exploratory data analysis (EDA), which is the probing of a dataset to summarize its main characteristics. EDA helps determine which statistical techniques are appropriate for a given dataset. Here are some projects where you can work on your EDA chops:
McDonald’s Nutrition Facts
McDonald’s food items are often controversial because of their high fat and sodium content. Using this dataset from Kaggle, you can perform a nutrition analysis of every menu item, including salads, beverages, and desserts. First, import the CSV file in Python. Then, categorize items according to factors like sugar and fiber content. Then you can model the results using bar and pie charts, scatter plots, and heatmaps. For this project, you’ll need the Numpy, Pandas, and Seaborn libraries.
World Happiness Report
The World Happiness Report surveys happiness levels around the globe. This project, from a student at Pennsylvania State University, uses SQLite, a popular database engine, to analyze the difference in happiness levels between the North and South hemispheres.
Global Suicide Rates
While there are countless datasets concerning suicide rates, this dataset created by Siddarth Sudhakar contains data from the United Nations Development Program, the World Bank, Kaggle, and the World Health Organization. Import the data into Python and use the Pandas library to explore the data. From there, you can summarize the data features. For example, you can uncover the relationship between suicide rates and GDP per capita.
Data Visualization
Visualizations communicate trends, outliers, and patterns in your data. So if you’re new to the field, and looking for a data analysis project, then creating visualizations is a great place to start. Select graphs that are ideal for the story you’re trying to tell. Bar charts and line charts succinctly illustrate changes over time, while pie charts model part-to-whole comparisons. Meanwhile, bar charts and histograms show the distribution of data. Here are some great data visualization projects for beginners:
Pollution in the United States
The Environmental Protection Agency releases annual data on air quality trends. This dataset from Kaggle features EPA pollution data from 2000–2016 in one CSV file. You can visualize this data using the Python Seaborn library or the OpenAir package in R. For example, you can model changes in emissions concentrations according to time, day of the week, or month. You can also use a heatmap to find the most polluted times of the year in a given area.
History Visualization
Data visualizations are a great way to illustrate historical events, such as the spread of the printing press or trends in coffee production and consumption. This visualization by Harvard Business School depicts the largest US companies in the year 1955. A second analysis in 2015 shows how much has changed. There is also an abundance of datasets available on World War II. This Kaggle dataset features data on weather conditions during the war, which had a major influence on the success of an invasion.
Astronomical Visualization
Modern telescopes and satellites produce digital images that are perfect for data visualization. This dataset from data.world shows future asteroids poised to pass near Earth within the next 12 months, as well as those that have made a close approach within the last 12 months. You can view live visualizations based on the dataset here to inspire your own analysis. You can also use this resource to find the asteroid orbital classes for each data point (eg: asteroid, apollo, centaur).
Instagram Visualization
This project on KDNuggets makes use of Jupyter notebooks and IPython to analyze Instagram data. Regular Python works fine, but you may not be able to display the images in your notebook. You can use Instagram data to compare the popularity of two political candidates, like this project, or perform a time series analysis on a public figure’s popularity before and after a major event.
Sentiment Analysis
Sentiment analysis (AKA “opinion mining”) entails using natural language processing (NLP) to determine how people feel about a product, public figure, or political party, for example. Each input is assigned a sentiment score, which classifies it as positive, negative, or neutral. You’ll definitely want to hone this skill to land a job in data analysis. Here are some great projects to add to your portfolio:
Twitter Sentiment Analysis
Social media posts can be classified according to polarity or emotion-specific keywords. The Apache NiFi GetTwitter processor obtains real-time tweets and ingests them into a messaging queue so you can obtain posts about a trending topic or hashtag. Alternatively, use Twitter’s Recent Search Endpoint. Once you’ve generated your dataset, you can determine sentiment scores using Microsoft Azure’s Text Analytics Cognitive Service, which identifies key phrases and entities such as people, places, and organizations.
Audience Reviews on Google
Google reviews are a great resource for customer feedback, and also make for a great data analysis project. The Google My Business API lets you extract reviews and work with location data. In this project on Medium, data enthusiast Nikita Bhole used Python to perform a sentiment analysis on user reviews from the Google Playstore. She then used Pandas profiling to perform an exploratory data analysis to find variables, interactions, correlations, and missing values. Next, she used TextBlob to calculate a sentiment score based on sentiment polarity and subjectivity.
Quora Question Pairing
Quora is one of the most popular question-and-answer websites in the world, making it ripe for data analysis. In a recent Kaggle challenge, users were tasked with using advanced NLP to classify duplicate question pairs. For example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora. This dataset from Quora contains over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair. In this project conducted by a group of NYU students, a basic linear model known as an n-gram was used to build a set of features to be used in a natural language understanding (NLU) model. Then they used scikit’s Support Vector Machine (SVM) implementation module for their experiments with word embedding.
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete data within a dataset. Messy data leads to unreliable outcomes. Cleaning data is an essential part of data analysis, and demonstrating your data cleaning skills is key to landing a job. Here are some projects to test out your data cleaning skills:
Airbnb Open Data (New York)
Airbnb’s open API lets you extract data on Airbnb stays from the company’s website. Alternatively, you can use this existing Kaggle dataset for Airbnb stays in New York City in 2019. Both data files include all the information needed to find out more about hosts and geographical availability, both of which are necessary metrics to make predictions and draw conclusions.
YouTube Videos Statistics
The top trending videos on YouTube provide an itinerant window into the current cultural zeitgeist. This dataset from Kaggle contains several months of data on daily trending YouTube videos from different countries. This includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count. Once cleaned, you could use this data for:
- Sentiment analysis
- Categorizing YouTube videos based on their comments and statistics.
- Analyzing what factors affect how popular a YouTube video will be
- Statistical analysis over time
Educational Statistics
This project, from the book Data Science in Education Using R, analyzes this dataset compilation from the US Department of Education Website to uncover federal data on students with disabilities. You can prepare the data for analysis by cleaning the variable names. Then, you can explore the dataset by visualizing student demographics.
Intermediate Data Analysis Projects
If you’re at the intermediate level and want to advance your data analysis career, you’ll want to improve your skills in data mining, data science, data collection, data cleaning, and data visualization. Here are some great projects to add to your portfolio:
Data Mining and Data Science
Data mining is the process of turning raw data into useful information. Here are some data mining projects that you can do to advance your career as a data analyst:
Speech Recognition
Speech recognition programs identify spoken words and convert them into text. To do this in Python, install a speech recognition package such as Apiai, SpeechRecognition, or Watson-developer-cloud. This project, which is called DeepSpeech, is an open-source speech-to-text engine using Google’s TensorFlow.
Anime Recommendation System
While streaming recommendation engines are useful, why not build a recommendation engine for a niche genre? This crowd-sourced dataset from Kaggle contains information on user preference data from 73,516 users on 12,294 anime shows. You can categorize similar shows based on reviews, characters, and synopses to build different recommendation algorithms.
Chatbots
A chatbot uses speech recognition to understand text inputs (chat messages) and generate responses. You can build a chatbot using the Natural Language Toolkit (NLTK) library in Python. Chatterbot is an open-source machine learning dialog engine on Github that lets anyone contribute dialog. Each time a user enters a statement, the library saves the text they entered. As Chatterbot receives more input, it learns to provide more varied responses with increasing accuracy.
Data Collection, Cleaning, and Visualization
Data collection is the process of gathering, measuring, and analyzing data from a variety of sources to answer questions, solve business problems, and investigate hypotheses. An effective data analysis project shows proficiency in all stages of the data analysis process, from identifying data sources to visualizing data. Here’s a project to advance your data collection, cleaning, and visualization skills:
Apple Watch Workout Analysis
The Apple Watch collects different types of workout data, including total calories burned, distance (for walking and running), average heart rate, and average pace. Using processed data, you can create visualizations such as rolling mean step count or step counts by days of the week, as seen in this project by full-stack engineer Mark Koester.
Get To Know Other Data Analytics Students
Reagan Tatsch
Data Operations Manager at ISS
Rahil Jetly
Sales Operations Manager at Springboard
Nelson Borges
Insights Analyst at LinkedIn
Advanced Data Analysis Projects
Ready for a more senior-level data analysis position? Here are some projects you can add to your portfolio:
Machine Learning
Machine learning enables computers to continuously make predictions based on the available data without being explicitly programmed to do so. These algorithms use historical data as input to predict new output values. Here are some common machine learning projects you can try out:
Fraud Detection
Machine learning uses models for fraud detection that continuously learn to detect new threats. This project for credit card fraud detection uses Amazon SageMaker to train supervised and unsupervised machine learning models, which are then deployed using Amazon SageMaker-managed endpoints.
Movie Recommendation System
Recommendation engines use data from user preferences and browsing history. To build a movie recommender, you can use this dataset from MovieLens, which contains 105,339 ratings applied to over 103,000 movies. Follow each step in more detail here.
Wine Quality Prediction
Wine classifiers make recommendations based on the chemical qualities of wine, such as density or acidity. This project on Kaggle uses the following three classifier models to predict the quality of wine:
- Random Forest Classifier
- Stochastic Gradient Descent Classifier
- Support Vector Classifier (SVC)
Pandas is also a useful library for this type of data analysis, while Numpy is good for working with arrays. Finally, you can use Seaborn and Matplotlib to visualize the data.
Netflix Personalization
To build a Netflix-inspired recommendation engine, create an algorithm that uses item-based collaborative filtering which establishes similarities between products based on user ratings. This project establishes filtering capabilities across IMDB ratings, metatags, actors, genre, language, year of release, and so on. To generate your own dataset, you can download publicly available subsets of IMDb data.
Natural Language Processing
Natural language processing (NLP) is a branch of AI that helps computers interpret and manipulate natural language in the form of text and audio. Try adding some of these NLP projects to your portfolio to land a more senior-level position:
News Translation
You can build a web application that translates news from one language to another using Python. In this project, data scientist Abubakar Abid used the Newspaper3k, a Python library that lets you scrape almost any news site. Then, he used the HuggingFaceTransformers, a state-of-the-art natural language model, to translate and summarize news articles from English to Arabic (you can choose another target language if desired). Finally, Abid deployed the Gradio library to build a web-based demo where he tried out the algorithm on different topics.
Autocomplete and Autocorrect
You can build a neural network in Python to autocomplete sentences and detect grammatical errors. This project on Github uses an LSTM model to autocomplete Python code to reduce the number of keystrokes required to write code. The model is trained after tokenizing Python code, which is more efficient than character-level prediction with byte-pair encoding.
Deep Learning
Deep learning is concerned with neural networks comprising three or more layers. These artificial neural networks are inspired by the structure and function of the human brain. Practice your deep learning skills with these projects:
Breast Cancer Classification
Breast cancer classification is a binary classification problem that works by categorizing biopsy photographs as benign or malignant. This project uses a convolutional neural network (CNN) to identify high-level features in the input images and implement matrix computations to infer a feature map.
Image Classification
Image classification models can be trained to recognize specific objects or features. You can build one using a CNN in Keras with Python. This project uses the CIFAR-10 dataset, a popular computer vision dataset consisting of 60,000 images with 10 different classes. The dataset is already available in the datasets module of Keras, so you can directly import it from keras.datasets.
Gender and Age Detection
An advanced Python project, this model uses OpenCV and a CNN with three convolutional layers to guess the gender and age of a person in an image using the Adience dataset.
What Skills Should You Focus on With Your Data Analysis Project?
Regardless of your level or skillset, data analysts can always improve on the following skills:
SQL
SQL is mainly used for storing and retrieving data from databases, writing queries, and modifying the schema (structure) of a database system. In your data analysis project, be sure to make use of some of the most important SQL commands, such as SELECT, DELETE, CREATE DATABASE, INSERT INTO, ALTER DATABASE, CREATE TABLE, and CREATE INDEX.
Programming
While data analysts don’t need to have advanced coding skills, the ability to program in R or Python lets you use more advanced data science techniques such as machine learning and natural language processing.
Data Cleaning Skills
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incomplete, duplicated, incorrect, or improperly formatted. Fixing spelling and syntax errors, standardizing naming conventions, and correcting mistakes are key skills.
Visualization
As a data analyst, it’s important to communicate your findings with strong visuals that appeal to both technical and non-technical stakeholders. To visualize your data effectively, you need to know the specific use cases for each type of visual, from bar charts to histograms and more.
Microsoft Excel
Data analysts use Excel and other spreadsheet tools to sort, filter, and clean their data. Excel is also a useful tool for doing simple calculations (eg: SUMIF and AVERAGEIF) or combining data using VLOOKUP.
Related Read: 65 Excel Interview Questions for Data Analysts
Familiarity With Machine Learning, AI, and Natural Language Processing
Data analysts with machine learning skills are incredibly valuable, even though machine learning is not an expected skill for most data analyst jobs. While data analytics is primarily concerned with data modeling and applied statistics, machine learning algorithms go a step further in obtaining insights and predicting future trends.
How To Present and Promote Your Data Analytics Projects
A good data analytics portfolio showcases your abilities. Each project should articulate the value of the data product or model you’ve built. Describe the technical challenge and how you overcame it successfully, what tools you leveraged and why, and explain your findings using well-chosen visuals.
Your portfolio should feature a diverse collection of projects, including exploratory data analysis projects, a data cleaning project, a project that uses SQL, and data visualization projects. Promote your projects by uploading them on Github. If you use Tableau for data visualization, set your project to ‘Public’ so that it is searchable online by potential employers.
Data Analysis Project FAQs
Can You Include Your Projects on Your Resume?
If you lack real-world experience, projects are a great way to show off your skills. List each project the way you would a job. Briefly describe the scope of the project, the technical challenges you faced, and the outcome.
How Long Do Data Analysis Projects Take To Complete?
Projects can take anywhere from one or two weeks to several months to complete. It depends on the size and complexity of your dataset, processing time, how much data cleaning is required, and whether or not you decide to use machine learning and AI.
What Do You Learn From Data Analysis Projects?
Personal projects provide the opportunity to experience the end-to-end data analysis process, from EDA to data visualization. Projects also give you a chance to generate your own datasets, frame problem statements, and choose the right visuals to illustrate your findings.
Since you’re here…
Interested in a career in data analytics? You will be after scanning this data analytics salary guide. When you’re serious about getting a job, look into our 40-hour Intro to Data Analytics Course for total beginners, or our mentor-led Data Analytics Bootcamp—there’s a job guarantee.