DataTransformationCRH/Projects.Rmd

---
title: "Projects"
author: "Scary Scarecrow"
date: "1/12/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(tidyr)

mutlstxlrdr<-function(){
  for( i in seq_along(sheet.na)){
  colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)
  df<-read.table("", col.names = colnames)
  assign(snames[i], df)
  
}
}
do.call(file.remove, list(list.files("./projects/errors/mandatory/", full.names = TRUE)))
do.call(file.remove, list(list.files("./projects/errors/codelist/", full.names = TRUE)))
do.call(file.remove, list(list.files("./projects/errors/length/", full.names = TRUE)))
do.call(file.remove, list(list.files("./projects/summary/", full.names = TRUE)))
do.call(file.remove, list(list.files("./projects/output/", full.names = TRUE)))
```


## Data transformation workflow

Following is the proposed preliminary workflow for the data transformation project.


>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there.

>*Make sure that there are no hidden files inside the directory.*

### Relationship files

```{r echo=TRUE, message=FALSE, warning=FALSE}
relfilenames <- list.files("./projects/relationship", pattern="*.xls", full.names = T)

print(relfilenames)
rel_files<-NULL
for(i in seq_along(relfilenames)){
  rel_files[[i]]<-read_excel(path = relfilenames[[i]], sheet = 1) 
  }

names(rel_files)<-gsub("./projects/relationship/R","",relfilenames)
# Names of the files imported
names(rel_files)
```


### Code Lists


```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}

filenames <- list.files("./projects/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).

# File paths
print(filenames)
```


Check manually if the above list includes all the codelist files
If correct, then read the files.

```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}
sheet_names <- lapply(filenames, excel_sheets) # Creates a list of the sheet names
codelist_files <- NULL
for (i in seq_along(filenames)) {
  a <- lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files
  names(a) <- c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
  codelist_files <- c(codelist_files, a)
}
# Names of the files imported
names(codelist_files)
# codelist_files<-unique(codelist_files)

```


### Templates


Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system.

```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE}
oldfilepath<-list.files("./projects/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern
print(oldfilepath)
```

Check it the list matches the actual files, manually.

```{r readlegacyfiles, echo=TRUE}

old_files<-NULL

#read_excel(path = oldfilepath[[i]], sheet = 1)
for(i in seq_along(oldfilepath)){
  old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1) 
  }

names(old_files)<-gsub("./projects/raw-data/","",oldfilepath)
```


*Some errors in the legacy file noticed. Columns with similar or same name exists.*


```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}
saptemplate<-read_excel("./projects/template.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)
```


*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*


```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}
#orilo<-"en_US.UTF-8"
#Sys.setlocale(locale="en_US.UTF-8")
strt<-Sys.time()
snames <- unique(saptemplate$`Sheet Name`)

for (h in seq_along(old_files)) {
  
  # Copy original data
  old.copy <- old_files[[h]]
  print(paste0(names(old_files[h])," imported"))
  
  err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal
  # Creates data frame for each sheet in snames
  for (i in seq_along(snames)) {
    print(paste0("Processing ..",snames[i]))
    
    # Select the column names from the field description sheet
    print("Creating template")
    sel.template.desc <-
      saptemplate[saptemplate$`Sheet Name` == snames[i], ]
    print("Creating column names")
    sel.template.desc.colnames <- sel.template.desc$Header
    
    # Create a list by adding values from corresponding legacy data
    temp <- NULL
    print("adding values to template ")
    if(snames[i] %in% c("Opportunity_Competitor_Party_In", "Opportunity_EndBuyer_Contact_Pa",
                        "Opportunity_External_Party_Info","Opportunity_Installed_Object",
                        "Opportunity_Product","Opportunity_Other_Party_Informa","Opportunity_Payer_Contact_Party",
                        "Opportunity_Product_Recipient_C","Opportunity_Prospect_Contact_Pa",
                        "Opportunity_Revenue_Splits","Opportunity_Sales_Employee_Part",
                        "Opportunity_Sales_Partner_Party","Opportunity_Notes",
                        "Opportunity_Competitor_Product","Contact_Party_Information",
                        "Opportunity_Item_Party_Informat","Opportunity_Product_Quantity_Pl",
                        "Opportunity_Product_Revenue_Pla","Opportunity_Product_Notes",
                        "Opportunity_Header_Revenue_Plan", "Opportunity_Account_Team_Party_")){
      next
    }
    
    if(snames[i]=="Opportunity"){
    for (j in seq_along(sel.template.desc.colnames)) {
      print(paste("Processing ",sel.template.desc.colnames[j]))
      if(sel.template.desc.colnames[j]=="Expected_Value"){
        temp[j]<-ifelse(!is.na(old.copy$`User Provided`), old.copy$`User Provided`, old.copy$`Potential Customer`)
        next
      }
      
      if(sel.template.desc.colnames[j]=="Sales_Unit" | sel.template.desc.colnames[j]=="Sales_Organization"){
        temp[j]<-paste0(substr(names(old_files[h]), 1, 2),"01")
        next
      }
      
      if(sel.template.desc.colnames[j]=="International_project"){
        temp[j]<-ifelse(is.na(old.copy[, sel.template.desc$oldkey[j]]),FALSE,TRUE)
        next
      }
      
      if(sel.template.desc.colnames[j]=="LEVIAT_specified"){
        temp[j]<-ifelse(!is.na(old.copy$halfenspecified),old.copy$halfenspecified, old.copy$competitor)
        next
      }
      if(sel.template.desc.colnames[j]=="Project_Country"){
        temp[j]<-ifelse(is.na(old.copy$Country), NA, substr(names(old_files[h]), 1, 2))
        next
      }
      if(sel.template.desc.colnames[j]=="BIM_designed"){
        temp[j]<-ifelse(is.na(old.copy$`BIM designed`), "Software Unknown", old.copy$`BIM designed`)
        next
      }
      
      temp[j] <- ifelse(!is.na(sel.template.desc$default[j]), as.character(as.vector(sel.template.desc$default[j])),
                        ifelse(
                          sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), NA, 
                          as.vector(old.copy[, sel.template.desc$oldkey[j]])
                        )
                        )
       
    }
          # Rename the columns according to field description
    print("renaming template ")
    names(temp) <- sel.template.desc.colnames
    
    # Create data frame from the list
    df <- as.data.frame(temp)
    print("Converted to data frame")
    }
    
    
   if(snames[i]=="Opportunity_Preceding_and_Follo"){
     
     old.copy.f<-old.copy |> filter(`Project hierarchy`=="Opportunity")
     if(nrow(old.copy.f)==0){next} #If not opportunity found in data go to next loop
      for (j in seq_along(sel.template.desc.colnames)) {
        temp[j] <- ifelse(!is.na(sel.template.desc$default[j]), as.character(as.vector(sel.template.desc$default[j])),
                        ifelse(
                          sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), NA, 
                          as.vector(old.copy.f[, sel.template.desc$oldkey[j]])
                        )
                        )
      }
      
      
    # Rename the columns according to field description
    print("renaming template ")
    names(temp) <- sel.template.desc.colnames
    
    # Create data frame from the list
    df <- as.data.frame(temp)
    print("Converted to data frame")
    
    corr.seq<-colnames(df) # preserving sequence name seq is not maintained post join
    
    df<-df |> 
      mutate(Opportunity_External_Key=str_sub(Reference_Doc_External_Key,1,str_length(Reference_Doc_External_Key)-4)) |> 
      mutate(External_Key=paste("OPF",Opportunity_External_Key,Reference_Doc_External_Key, sep = "_")) |> select(corr.seq)
        
    }   
    
    if(snames[i]=="Opportunity_Party_Information"){
     rdf<-rel_files[[names(old_files[h])]]
     if(is.null(rdf)){next} #If not data found loop
      for (j in seq_along(sel.template.desc.colnames)) {
        temp[j] <- ifelse(!is.na(sel.template.desc$default[j]), as.character(as.vector(sel.template.desc$default[j])),
                        ifelse(
                          sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), NA, 
                          as.vector(rdf[, sel.template.desc$oldkey[j]])
                        )
                        )
      }
      
    
    # Rename the columns according to field description
    print("renaming template ")
    names(temp) <- sel.template.desc.colnames
    
    # Create data frame from the list
    df <- as.data.frame(temp)
    print("Converted to data frame")
    
    corr.seq<-colnames(df) # preserving sequence name seq is not maintained post join
    
    df<-df |> 
      mutate(External_Key=paste("INV",Opportunity_External_Key,Party_ID,Role,Party_External_Key,sep="_")) |> select(corr.seq)
    }

    if(snames[i]=="Opportunity_Sales_Team_Party_In"){
     
      for (j in seq_along(sel.template.desc.colnames)) {
        temp[j] <- ifelse(!is.na(sel.template.desc$default[j]), as.character(as.vector(sel.template.desc$default[j])),
                        ifelse(
                          sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), NA, 
                          as.vector(old.copy[, sel.template.desc$oldkey[j]])
                        )
                        )
      }
      
      
    # Rename the columns according to field description
    print("renaming template ")
    names(temp) <- sel.template.desc.colnames
    
    # Create data frame from the list
    df <- as.data.frame(temp)
    print("Converted to data frame")
    
    corr.seq<-colnames(df) # preserving sequence name seq is not maintained post join
    #if(names(old_files[h])=="DE.xls"){stop()}
    df<-df |> mutate(resp=old.copy$Responsible, apptech=old.copy$`application technology`, backoff=old.copy$`Back office`, 
             pres=old.copy$Presales) |>
      #mutate(resp=paste0(resp,"_resp"), apptech=paste0(apptech,"_apptech"), backoff=paste0(backoff,"_backoff")) |> 
      pivot_longer(cols = c(resp, apptech, backoff, pres)) |> 
  filter(!is.na(value)) |> 
      select(-c(Party_ID,Role)) |> 
  rename(Party_ID=value) |> 
  rename(Role=name) |> 
    mutate(Role=ifelse(Role=="resp","39", ifelse(Role=="apptech", "ZIN016", ifelse(Role=="backoff", "ZIN002","ZIN011")))) |> 
      mutate(External_Key=paste("PAR",Opportunity_External_Key,Party_ID,Role, sep="_")) |> 
  select(corr.seq)
        
    }
    
    # Error summary file
      Expected<-nrow(df)
    
    #Select essential rows
    print("Identifying essential rows")
    sel.template.desc |>
      filter(Mandatory == "Yes") |>
      pull(Header) -> essential.columns
    
    error.mandatory <- NULL
    error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL)
    # Operate on essential columns including creation of error file
    for (k in seq_along(essential.columns)) {
      

      print("Creating and writing data with missing mandatory values")
      manerrdt<-df[is.na(df[, essential.columns[k]]), ]
      if(nrow(manerrdt>0)){
        manerrdt<-manerrdt |> mutate(error=paste0(essential.columns[k]," missing"))
      }
      assign(
        paste0(
          "error_mandatory_",
          substr(names(old_files[h]), 2, 3),
          "_",
          snames[i],
          "_",
          essential.columns[k]
        ),
        manerrdt
      )
      # TO be saved in error files
      
      if(nrow(manerrdt)>0){
              write.csv(
        manerrdt,
        paste0(
          "./projects/errors/mandatory/",
          substr(names(old_files[h]), 1, 2),
          "_",
          snames[i],
          "_",
          essential.columns[k],
          "_error_mandatory.csv"
        ), row.names = F, na=""
      )
      }
      # Error summary file
      Country<-substr(names(old_files[h]), 1, 2)
      Name<-snames[i]
      err.type<-paste0("Missing ",essential.columns[k])
      err.count<-nrow(df[is.na(df[, essential.columns[k]]), ])

      
      print("Removing rows with empty essetial columns")
      df <- df[!is.na(df[, essential.columns[k]]), ]
      if(err.count>0){
        error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
      }
    }
    
    print("Identifying columns associated with codelists")
    # List of columns that have a codelist
    codelistcols <- sel.template.desc |>
      filter(!is.na(`CodeList File Path`)) |> pull(Header)
    for (k in seq_along(codelistcols)) {
      #   if(codelistcols[k]=="Currency"){
      #     print("Found Currency. Adding 0.")
      #   df$International_Version<-"CHF"
      # }
      
      print(paste0("Identifying errors ",codelistcols[k]))
      def.rows <-
        which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA))
      def.n<- df[def.rows, 1]
      def.rows.val <-
        df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]]
      def.colname <- rep(codelistcols[k],length.out = length(def.rows))
      def <- data.frame(def.rows, def.n,def.rows.val,def.colname)
      if(nrow(def>0)){
              assign(paste0(
        "error_codematch_",
        substr(names(old_files[1]), 1, 2),
        "_",
        snames[i],
        "_",
        codelistcols[k]
      ),
      def) # TO be saved
        write.csv(
        def,
        paste0(
          "./projects/errors/codelist/",
          substr(names(old_files[h]), 1, 2),
          "_",
          snames[i],
          "_",
          codelistcols[k],
          "_error_codematch_.csv"
        ), row.names = F, na=""
      )
      }
      err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal
      err.count<-nrow(def) #Error cal
            if(err.count>0){
        error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
      }
      

      print(paste0("Removing errors ",codelistcols[k]))
      # Removes any mismatch
      df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <-
        NA
      
      # Matches each column with the corresponding code list and returns the value
      df[, codelistcols[k]] <-
        pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]),
                                                            pull(codelist_files[codelistcols[k]][[1]], Description))]
      
    }
    max.length <- as.numeric(sel.template.desc$`Max Length`)
    dtype <- sel.template.desc$`Data Type`
    rowval <- NULL
    ival <- NULL
    rval <- NULL
    lenght.issue.df <- NULL
    # Changing the data class
    for (k in 1:ncol(df)) {
      
      if (dtype[k] == "String") {
        df[, k] <- as.character(pull(df, k))
      }
      if (dtype[k] == "Boolean") {
        df[, k] <- as.logical(pull(df, k))
      }
      if (dtype[k] == "DateTime") {
        df[, k] <- lubridate::ymd(pull(df, k))
      }
      if (dtype[k] == "Time") {
        df[, k] <- lubridate::hms(pull(df, k))
        
      } # This list will increase and also change based on input date and time formats
      
      
    }
    
    # print("Rectifying streetname")
    # # Street and House Number
    # if (any(colnames(df) == "Street")) {
    # print("found steet")
    #   # stop()
    #   
    #   df$Streetname<-NA
    #   df$HouseNumber<-NA
    #   #df |> extract("Street", "(\\D+)(\\d.*)")
    #   df<-tidyr::extract(df,
    #           "Street",
    #           c("Streetname", "HouseNumber"),
    #           "(\\D+)(\\d.*)")
    #   df <- df |>
    #     select(-c("House_Number")) |>
    #     rename(Street = Streetname, House_Number = HouseNumber) |>
    #     select(all_of(sel.template.desc.colnames))
    # }
    
    # Length Rectification
    colclasses <- lapply(df, class)
    print("Rectifying Length")
    for (k in 1:ncol(df)) {
      if (colclasses[[k]] == "character") {
        print("found character column ")
        rowval <- pull(df, 1)
        ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k)))
        rval <- max.length[k]
        colval <- pull(df, k)
        colnm<-colnames(df)[k]
        cntr<-substr(names(old_files[h]), 1, 2)
        # rectifying data length
        df[, k] <-
          ifelse(nchar(pull(df, k)) > max.length[k],
                 substring(pull(df, k), 1, max.length[k]),
                 pull(df, k))
      }
      
      lenght.issue.df <-
        rbind(lenght.issue.df, data.frame(rowval, ival, rval, colnm, colval,cntr))  
      
      err.type<- paste0("Length error ", colnames(df)[k]) # Error cal
      err.count<- sum(ival>rval, na.rm = T) # Error cal
      if(err.count>0){
        error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
      }
      
      
    }
    
    lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval)
    
    
    if(nrow(lenght.issue.df)>0){
        write.csv(lenght.issue.df,
              paste0(
      "./projects/errors/length/",
      substr(names(old_files[h]), 1, 2),
      "_",
      snames[i],
      "_length_error.csv"
    ), row.names = F, na="")
    }

    assign(snames[i], df)
    write.csv(df,paste0("./projects/output/", substr(names(old_files[h]), 1, 2), "_", snames[i],".csv"), row.names = F, na="")
    if(nrow(error.df)>0){
    write.csv(error.df, paste0("./projects/summary/",substr(names(old_files[h]), 1, 2), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write
    }
    err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal
    
  }
  write.csv(err.summ,
              paste0("./projects/summary/" ,substr(names(old_files[h]), 1, 2), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write
}

end<-Sys.time()

end-strt


```
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).*
*Manually verify if these are the required templates*


```{r}
opfilepath<-list.files("./projects/output", pattern="*Opportunity.csv", full.names = T)
opfiles<-lapply(opfilepath, read.csv)


opdf<-do.call(rbind.data.frame, opfiles)
write.csv(opdf,"./projects/output/combined/combinedopportunity.csv")

opfilepath<-list.files("./projects/output", pattern="*Opportunity_Party_Information.csv", full.names = T)
opfiles<-lapply(opfilepath, read.csv)


opdf<-do.call(rbind.data.frame, opfiles)
write.csv(opdf,"./projects/output/combined/combinedopportunitypartyinfo.csv")


opfilepath<-list.files("./projects/output", pattern="*Opportunity_Preceding_and_Follo.csv", full.names = T)
opfiles<-lapply(opfilepath, read.csv)


opdf<-do.call(rbind.data.frame, opfiles)
write.csv(opdf,"./projects/output/combined/combinedopportunityprecedingfollo.csv")


opfilepath<-list.files("./projects/output", pattern="*Opportunity_Sales_Team_Party_In.csv", full.names = T)
opfiles<-lapply(opfilepath, read.csv)


opdf<-do.call(rbind.data.frame, opfiles)
write.csv(opdf,"./projects/output/combined/combinedopportunitysalesteampartyin.csv")


```