DataTransformationCRH/workflow design.Rmd


								---

								title: "Workflow Design"

								author: "LaNubia Consulting, Data Science Team"

								date: "12/13/2021"

								output:

								  html_document:

								    code_download: true

								    toc: true

								    toc_float: true

								    theme: flatly

								    highlight: monochrome

								---


								```{css, echo=FALSE}

								pre {

								  max-height: 300px;

								  overflow-y: auto;

								}


								pre[class] {

								  max-height: 100px;

								}

								```

								---


								```{r setup, include=FALSE}

								knitr::opts_chunk$set(echo = TRUE)

								library(readxl)

								library(dplyr)

								library(lubridate)

								library(DT)


								mutlstxlrdr<-function(){

								  for( i in seq_along(sheet.na)){

								  colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)

								  df<-read.table("", col.names = colnames)

								  assign(snames[i], df)


								}

								}


								```


								## Data transformation workflow


								Following is the proposed preliminary workflow for the data transformation project.


								### Code Lists


								Let's store them in a `directory` and try reading them without causing pain in the fingers or wear and tear on mouse and trackpad. First, let's create a list of files of code lists.


								```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}


								filenames <- list.files("./contacts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).


								# File paths

								print(filenames)

								```


								*Please ensure that there are no hidden files in the directory*


								Now, let's attempt reading them


								```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}

								sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names


								for(i in seq_along(filenames)){

								  codelist_files<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files

								  names(codelist_files)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above

								  # for(j in seq_along(sheet_names[[i]])){

								  #   assign(paste0(substr(filenames[[i]],30,nchar(filenames[[i]])-5),"_",sheet_names[[i]][j]), read_excel(path=filenames[[i]], sheet = sheet_names[[i]][j]))

								  # }

								  }

								# Names of the files imported

								names(codelist_files)


								```


								### Templates


								Now we shall extract the templates. There are two templates for each file. One for SAP i.e. the file that needs to uploaded to SAP. And the other is the file that needs to be converted to the SAP template format.


								We shall start with the legacy format. Since we do not have the real data, we have created a dummy. Right now, just one table of `Contact`. Some intentional errors have been introduced in the file.


								Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system.


								```{r readlegacy, echo=TRUE, message=FALSE, warning=FALSE}

								oldfilepath<-("./contacts/olddummy.xlsx")

								old.data<-lapply(excel_sheets(oldfilepath), read_excel, path=oldfilepath)

								names(old.data)<-excel_sheets(oldfilepath)

								# Names of the files imported

								names(old.data)

								```


								We shall use the `Contact_o` i.e. old contact table and transform into the required SAP upload format after checking for possible errors.


								*The process to transform all the data from one segment (for e.g. Contacts or Accounts etc.) can implemented only if we guarantee error free data in the legacy system. Since that is not possible, we need to do it per table (per sheet of the excel file).*


								Now we shall create SAP template.


								Although the file has multiple sheets, only the last sheet i.e. `Field_Definitions` holds enough information for us to create the template.


								```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}

								saptemplate<-read_excel("./contacts/Contact.xlsx", sheet = "Field_Definitions")

								# First few rows of the imported data

								head(saptemplate)

								```


								*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*


								Although we will be using only one table at the moment, all the templates have been exported below.


								```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}


								snames<-unique(saptemplate$`Sheet Name`)


								# Creates data frame for each sheet in snames

								for( i in seq_along(snames)){

								  colnames<-saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header # Defines the column names

								  df<-read.table("", col.names = colnames) # Creates an empty data frame using the column names

								  assign(snames[i], df) # Assigns value of df to a data frame named in sname


								}

								```


								## Transforming and errors


								Steps to check the legacy data for errors and transform into SAP compatible format.


								### Check the column names and sequence of the file


								```{r colnamecheck, echo=TRUE, message=FALSE, warning=FALSE}

								# Column names of the Contact table

								colnames(Contact)

								```


								### Create a copy of the old file to work on.


								```{r copylegacy, echo=TRUE, message=FALSE, warning=FALSE}

								old.copy<-old.data$Contact_o # Selecting only one table as sample

								old.copy

								```


								### Rename the columns as per column name of the template.


								To do this in a safe way there are two options


								 - Create another column in Field Definition excel sheet that contains the corresponding column names in the legacy file.

								 - Create a separate sheet where this mapping is defined


								Manual typing is error prone and hence should be avoided at least in this code.

								For the time being, we have created a separate file. Although the other way is easier to maintain and recommended.


								```{r columnrename, echo=TRUE, message=FALSE, warning=FALSE}

								mapped<-read.csv("./contacts/contact_map.csv", sep=";")

								x=NULL

								for(i in 1:nrow(mapped)){

								  x[i] = mapped[mapped$oldkey==colnames(old.copy[i]),]$Header

								}

								colnames(old.copy)<-x # Changing column names

								```


								### Check for errors


								Essential rows


								```{r listmandatory, echo=TRUE, message=FALSE, warning=FALSE}

								saptemplate[saptemplate$`Sheet Name`=="Contact",] |>

								  filter(Mandatory=="Yes") |>

								  pull(Header) -> essential.rows # List of mandatory columns

								```


								Check if some rows have missing items for mandatory columns


								```{r checkessentialrows, echo=TRUE, message=FALSE, warning=FALSE}

								essen.rows.table=read.table("", col.names = c("Item","Missing"))


								for(i in seq_along(essential.rows)){

								  essen.rows.table[i,2]<-sum(is.na(old.copy[,essential.rows[i]]))

								  essen.rows.table[i,1]<-essential.rows[i]

								} # Creates the table below


								essen.rows.table

								```


								 Remove the rows with missing mandatory values


								```{r removeessenrows, echo=TRUE, message=FALSE, warning=FALSE}


								for(i in seq_along(essential.rows)){

								  old.copy<-old.copy[!is.na(old.copy[,essential.rows[i]]),]

								} # Remove the rows with missing mandatory values


								```


								Check if code listed column data are from the codelist


								```{r codelistcolcheck, echo=TRUE, message=FALSE, warning=FALSE}

								codelistcols<-saptemplate[saptemplate$`Sheet Name`=="Contact",] |>

								  filter(!is.na(`CodeList File Path`)) |> pull(Header) # List of columns that have a codelist


								codelisted.rows.table=read.table("", col.names = c("Item","Missing", "Not_from_code"))

								for(i in seq_along(codelistcols)){

								  codelisted.rows.table[i,3]<-sum(!pull(old.copy[,codelistcols[i]],1) %in% c(pull(codelist_files[codelistcols[i]][[1]],Description),NA)) # Added NA else empty columns also get counted

								  codelisted.rows.table[i,2]<-sum(is.na(old.copy[,codelistcols[i]]))

								  codelisted.rows.table[i,1]<-codelistcols[i]

								} # Creates the table below

								codelisted.rows.table

								```


								If values do not match, we empty the value


								```{r removemismatchcode, echo=TRUE, message=FALSE, warning=FALSE}


								for(i in seq_along(codelistcols)){

								  old.copy[!pull(old.copy[,codelistcols[i]],1) %in% c(pull(codelist_files[codelistcols[i]][[1]],Description),NA),codelistcols[i]]<-NA

								} # Removes the value in case of mismatch


								```


								### Translate into SAP language using code list


								```{r joincodelistcodes, echo=TRUE, message=FALSE, warning=FALSE}

								for(i in seq_along(codelistcols)){

								  old.copy[,codelistcols[i]]<-

								  pull(codelist_files[codelistcols[i]][[1]],2)[match(pull(old.copy,codelistcols[i]), pull(codelist_files[codelistcols[i]][[1]],Description))]

								} # Matches each column with the corresponding code list and returns the value

								```


								###Fix column types


								```{r datatypecheck, echo=TRUE, message=FALSE, warning=FALSE}


								dtype<-saptemplate[saptemplate$`Sheet Name`=="Contact",]$`Data Type` # List of data types. Non Exhaustive ATM


								for(i in 1:ncol(old.copy)){

								  if(dtype[i] == "String"){

								    old.copy[,i] <- as.character(pull(old.copy,i))

								  }

								  if(dtype[i] == "Boolean"){

								    old.copy[,i] <- as.logical(pull(old.copy,i))

								    }

								  if(dtype[i] == "DateTime"){

								    old.copy[,i] <- lubridate::ymd_hms(pull(old.copy,i))

								    }

								  if(dtype[i] == "Time"){

								      old.copy[,i] <- lubridate::hms(pull(old.copy,i))


								  } # This list will increase and also change based on input date and time formats


								}


								```


								### String length


								```{r checktrimstringlength, echo=TRUE, message=FALSE, warning=FALSE}

								max.length<-saptemplate[saptemplate$`Sheet Name`=="Contact",]$`Max Length` # List of max lengths mentioned


								colclasses<-lapply(old.copy,class) # getting column classes


								for(i in 1: ncol(old.copy)){

								  if(colclasses[[i]]=="character"){

								    old.copy[,i]<- ifelse(nchar(pull(old.copy,i))>max.length[i], substring(pull(old.copy,i),1,max.length[i]), pull(old.copy,i))

								  } # If string length is more than mentioned, trim it to the mentioned


								}


								```


								### Save file


								```{r savefile, echo=TRUE, message=FALSE, warning=FALSE}

								write.csv(old.copy, "Contact.csv",row.names=FALSE) # Saving CSV file

								```


								view the exported file


								```{r}

								contacts.sap<-read.csv("Contact.csv")

								datatable(contacts.sap,options = list(scrollX = TRUE))

								```


								**Viewing is for sample only. Larger files cannot be viewed in html, requires server side processing.**


								## Questions and Comments


								- What shall we do with values that do not match items in code list? Currently we are removing those entries. But there may be cases where removing may not make sense. for e.g. (M instead of Male)

								- What shall we do with missing items in normal columns (non-coded)?

								- What shall we do with missing values in mandatory columns? Currently deleting the complete row

								- What if the country mentioned is not from the state mentioned? Shall we check that too?

								- What is the input format of logical variables (Yes/no, T/F, TRUE/FALSE etc)?

								- What is the output format (SAP) for logical variables?

								- What is the input format of date, datetime and time variables?

								- What is the output format of date, datetime and time variables?

								- What to do if a string is larger than what is allowed?

								- Where code lists are mentioned, what is the input variable? It is assumed that the input files will have descriptions and not the code

								- Where code lists are mentioned, what should be the output (Code or description)? Currently We take the description as input and convert it to code.


								This document contains several explanations. And the workflow has been divided into steps. These steps may increase and decrease (by combining several steps), depending on the data quality and structure. For e.g. if the input datetime format is different in different tables, each table may require manual transformation, increasing the number of steps and complexity. If the data quality is good, it may even be possible to transform all the tables in a segment (contacts or accounts etc.) may be transformed in a single run. It will eventually be clear only after obtaining the input data. The code can further be adjusted for faster processing.