The Importance of the Basics: Data Management

Every day we interact with data, whether it’s through our everyday decision-making or at a business operational level. However, data is unstructured and our goal as business and data analysts is to turn it into information and knowledge. Sometimes, analysts forget this goal and focus too much on building complex databases and data warehouses without paying attention to the quality and applicability of the data. Data management is the foundation block to everything else that has to do with analyzing data. If our first steps are thoughtless, then our whole structure could crumble. In data management, we gather, process, and examine our data based on the end goal established. 


1. Information extraction - Our first step is information extraction, where we use different tools such as website, documents, and other sources to retrieve the specific information we’re trying to analyze. The data we extract needs to help answer the goal that’s set at the beginning. There are different ways we can extract information, a popular one being web scraping. 

Data scraping  is a process one by computer programs by extracting readable output from other sources, like websites as it’s in the case of web scraping. There are many web scraping tools to choose from. I’ve tested the webscrpaer.io developer tool and it’s proven useful when web scraping an extensive website, for example Zillow or Indeed.com.


2. Data processing – After we extract and collect our data, we now have to adjust its format and organization to make it easier to read and to allow the downstream process that will give us the knowledge we’re after. Data processing looks different depending on the data and our goal. A good place to start is looking at the structure of the data, including the data types, structural classification, and general observations, such as number of data points (observations) and number of factors (columns). Part of data processing is performing data cleaning tasks, which can be divided into three parts: parsing, correction, standardization. 

  • Parsing: identifying and separating data elements from an unstructured text. This involves looking for patterns and specifications that belong to the factors in our data. Regular expressions become useful in this process by helping us identify these patterns to match our set factors. 
  • Correction: addressing the values that may cause errors when streaming the data. This includes filling missing values, checking for conflicting values and fixing datatypes in our columns. 
  • Standardization: creating preferred formats and applying them to our data. Some common examples are date formats, addresses (including state notation) and commonly used institution names (including abbreviations). 

3. Data Profiling – Our last step in data management is getting the knowledge from our now collected and clean data. Data profiling takes the data and prepares summaries about it. Among the summary outputs from data profiling we have, frequency distributions for each variable, identifying and categorizing missing values, checking data types, and assessing data integrity. With data profiling we can assess our data quality and decide if there’s any changes need to be made in the earlier stapes of the process. 


Choosing a tool for processing and profiling data

There’s a growing number of applications and programming languages that can assist with the processing and profiling of our data. Personally, there’s two languages I’ve studied and tried which are R and Python. Each language has unique functions and libraries that allow for us to process the data and create data profiling outputs. It’s important to know the difference in their syntax since they can lead to very different results when processing data. For example, for indexing, R starts from 1 whereas Python starts from 0. As for data profiling, below I included screenshots of the first section of Python and R profiling outputs  respectively, done on the same dataset.







Garbage in, garbage out 

The steps in data management may seem redundant and common sense for most analysts. However, the quality of our data determines everything else we do with it. If our data is not properly curated, then everything we do with it will not be accurate. As analysts, we tell the computers what to do with the data because we are the ones the understand the knowledge, we’re after. We can’t expect our programs to run correctly if we give them data that’s unorganized and nonsensical. In other words,  if we start with garbage data, we can expect to end with garbage outputs. 





Comments

Popular posts from this blog

Big data has a big reputation

The 4s of Database Management Systems (DBMS)