322 changed files with 125109 additions and 0 deletions
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -0,0 +1,365 @@ |
|||
--- |
|||
title: "Accounts" |
|||
author: "Scary Scarecrow" |
|||
date: "1/10/2022" |
|||
output: html_document |
|||
--- |
|||
|
|||
```{r setup, include=FALSE} |
|||
knitr::opts_chunk$set(echo = TRUE) |
|||
library(readxl) |
|||
library(dplyr) |
|||
library(lubridate) |
|||
library(DT) |
|||
library(tidyr) |
|||
|
|||
mutlstxlrdr<-function(){ |
|||
for( i in seq_along(sheet.na)){ |
|||
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header) |
|||
df<-read.table("", col.names = colnames) |
|||
assign(snames[i], df) |
|||
|
|||
} |
|||
} |
|||
``` |
|||
|
|||
|
|||
## Data transformation workflow |
|||
|
|||
Following is the proposed preliminary workflow for the data transformation project. |
|||
|
|||
|
|||
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there. |
|||
|
|||
>*Make sure that there are no hidden files inside the directory.* |
|||
|
|||
|
|||
### Code Lists |
|||
|
|||
|
|||
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE} |
|||
|
|||
filenames <- list.files("./accounts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.). |
|||
|
|||
# File paths |
|||
print(filenames) |
|||
``` |
|||
|
|||
|
|||
Check manually if the above list includes all the codelist files |
|||
If correct, then read the files. |
|||
|
|||
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE} |
|||
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names |
|||
codelist_files<-NULL |
|||
for(i in seq_along(filenames)){ |
|||
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files |
|||
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above |
|||
codelist_files<-c(codelist_files,a) |
|||
} |
|||
# Names of the files imported |
|||
names(codelist_files) |
|||
#codelist_files<-unique(codelist_files) |
|||
codelist_files$Customer_type_I |
|||
``` |
|||
|
|||
|
|||
|
|||
### Templates |
|||
|
|||
|
|||
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system. |
|||
|
|||
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE} |
|||
oldfilepath<-list.files("./accounts/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern |
|||
print(oldfilepath) |
|||
``` |
|||
|
|||
Check it the list matches the actual files, manually. |
|||
|
|||
```{r readlegacyfiles, echo=TRUE} |
|||
|
|||
old_files<-NULL |
|||
|
|||
#read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
for(i in seq_along(oldfilepath)){ |
|||
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
} |
|||
old_files |
|||
names(old_files)<-gsub("./accounts/raw-data/","",oldfilepath) # Change path |
|||
``` |
|||
|
|||
|
|||
*Some errors in the legacy file noticed. Columns with similar or same name exists.* |
|||
|
|||
|
|||
|
|||
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE} |
|||
saptemplate<-read_excel("./accounts/template.xlsx", sheet = "Field_Definitions") |
|||
# First few rows of the imported data |
|||
head(saptemplate) |
|||
``` |
|||
|
|||
|
|||
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction* |
|||
|
|||
|
|||
## Don't have Status column defined |
|||
## There could be issue in line of business |
|||
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE} |
|||
#orilo<-"en_US.UTF-8" |
|||
#Sys.setlocale(locale="en_US.UTF-8") |
|||
strt<-Sys.time() |
|||
snames <- unique(saptemplate$`Sheet Name`) |
|||
|
|||
for (h in seq_along(old_files)) { |
|||
|
|||
|
|||
# Copy original data |
|||
old.copy <- old_files[[h]] |
|||
print(paste0(names(old_files[h])," imported")) |
|||
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal |
|||
# Creates data frame for each sheet in snames |
|||
for (i in seq_along(snames)) { |
|||
print(paste0("Processing ..",snames[i])) |
|||
|
|||
# Select the column names from the field description sheet |
|||
print("Creating template") |
|||
sel.template.desc <- |
|||
saptemplate[saptemplate$`Sheet Name` == snames[i], ] |
|||
print("Creating column names") |
|||
sel.template.desc.colnames <- sel.template.desc$Header |
|||
|
|||
# Create a list by adding values from corresponding legacy data |
|||
temp <- NULL |
|||
print("adding values to template ") |
|||
for (j in seq_along(sel.template.desc.colnames)) { |
|||
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), |
|||
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]]) |
|||
) |
|||
|
|||
} |
|||
|
|||
# Rename the columns according to field description |
|||
print("renaming template ") |
|||
names(temp) <- sel.template.desc.colnames |
|||
|
|||
# Create data frame from the list |
|||
df <- as.data.frame(temp) |
|||
print("Converted to data frame") |
|||
|
|||
# Error summary file |
|||
Expected<-nrow(df) |
|||
|
|||
#Select essential rows |
|||
print("Identifying essential rows") |
|||
sel.template.desc |> |
|||
filter(Mandatory == "Yes") |> |
|||
pull(Header) -> essential.columns |
|||
|
|||
error.mandatory <- NULL |
|||
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL) |
|||
|
|||
# Operate on essential columns including creation of error file |
|||
for (k in seq_along(essential.columns)) { # In case there are any default values (of mandatory) they need to be added here |
|||
if(essential.columns[k]=="International_Version"){ |
|||
print("Found International Version. Adding 0.") |
|||
df$International_Version<-"0" |
|||
} |
|||
|
|||
print("Creating and writing data with missing mandatory values") |
|||
assign( |
|||
paste0( |
|||
"error_mandatory_", |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k] |
|||
), |
|||
df[is.na(df[, essential.columns[k]]), ] |
|||
) |
|||
# TO be saved in error files |
|||
|
|||
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){ |
|||
write.csv( |
|||
df[is.na(df[, essential.columns[k]]), ], |
|||
paste0( |
|||
"./acounts/errors/mandatory/", #Change path |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k], |
|||
"_error_mandatory.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
# Error summary file |
|||
Country<-substr(names(old_files[h]), 2, 3) |
|||
Name<-snames[i] |
|||
err.type<-paste0("Missing ",essential.columns[k]) |
|||
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ]) |
|||
|
|||
print("Removing rows with empty essetial columns") |
|||
df <- df[!is.na(df[, essential.columns[k]]), ] |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
} |
|||
|
|||
print("Identifying columns associated with codelists") |
|||
# List of columns that have a codelist |
|||
codelistcols <- sel.template.desc |> |
|||
filter(!is.na(`CodeList File Path`)) |> pull(Header) |
|||
|
|||
for (k in seq_along(codelistcols)) { |
|||
|
|||
print(paste0("Identifying errors ",codelistcols[k])) |
|||
def.rows <- |
|||
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA)) |
|||
def.n<- df[def.rows, 1] |
|||
def.rows.val <- |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] |
|||
def <- data.frame(def.rows, def.n,def.rows.val) |
|||
if(nrow(def>0)){ |
|||
assign(paste0( |
|||
"error_codematch_", |
|||
substr(names(old_files[1]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k] |
|||
), |
|||
def) # TO be saved |
|||
write.csv( |
|||
def, |
|||
paste0( |
|||
"./accounts/errors/codelist/", #Change path |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k], |
|||
"_error_codematch_.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal |
|||
err.count<-nrow(def) #Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
print(paste0("Removing errors ",codelistcols[k])) |
|||
# Removes any mismatch |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <- |
|||
NA |
|||
|
|||
# Matches each column with the corresponding code list and returns the value |
|||
df[, codelistcols[k]] <- |
|||
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]), |
|||
pull(codelist_files[codelistcols[k]][[1]], Description))] |
|||
|
|||
} |
|||
max.length <- as.numeric(sel.template.desc$`Max Length`) |
|||
dtype <- sel.template.desc$`Data Type` |
|||
rowval <- NULL |
|||
ival <- NULL |
|||
rval <- NULL |
|||
lenght.issue.df <- NULL |
|||
# Changing the data class |
|||
for (k in 1:ncol(df)) { |
|||
|
|||
if (dtype[k] == "String") { |
|||
df[, k] <- as.character(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Boolean") { |
|||
df[, k] <- as.logical(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "DateTime") { |
|||
df[, k] <- lubridate::ymd_hms(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Time") { |
|||
df[, k] <- lubridate::hms(pull(df, k)) |
|||
|
|||
} # This list will increase and also change based on input date and time formats |
|||
|
|||
|
|||
|
|||
} |
|||
|
|||
print("Rectifying streetname") |
|||
# Street and House Number |
|||
if (any(colnames(df) == "Street")) { |
|||
df$Streetname<-NA |
|||
df$HouseNumber<-NA |
|||
# Separates streetname and housenumber |
|||
extract(df, |
|||
"Street", |
|||
c("Streetname", "HouseNumber"), |
|||
"(\\D+)(\\d.*)") |
|||
df <- df |> |
|||
select(-c("Street", "House_Number")) |> |
|||
rename(Street = Streetname, House_Number = HouseNumber) |> |
|||
select(sel.template.desc.colnames) |
|||
} |
|||
|
|||
# Length Rectification |
|||
colclasses <- lapply(df, class) |
|||
print("Rectifying Length") |
|||
for (k in 1:ncol(df)) { |
|||
if (colclasses[[k]] == "character") { |
|||
print("found character column ") |
|||
rowval <- pull(df, 1) |
|||
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k))) |
|||
rval <- max.length[k] |
|||
# rectifying data length |
|||
df[, k] <- |
|||
ifelse(nchar(pull(df, k)) > max.length[k], |
|||
substring(pull(df, k), 1, max.length[k]), |
|||
pull(df, k)) |
|||
} |
|||
|
|||
lenght.issue.df <- |
|||
rbind(lenght.issue.df, data.frame(rowval, ival, rval)) |
|||
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal |
|||
err.count<- sum(ival>rval, na.rm = T) # Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
} |
|||
|
|||
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval) |
|||
if(nrow(lenght.issue.df)>0){ |
|||
write.csv(lenght.issue.df, |
|||
paste0( |
|||
"./accounts/errors/length/", # Change path |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_length_error.csv" |
|||
), row.names = F, na="") |
|||
} |
|||
|
|||
assign(snames[i], df) |
|||
|
|||
write.csv(df,paste0("./acounts/output/", substr(names(old_files[h]), 2, 3), "_", snames[i],".csv"), row.names = F, na="") #Chnage path |
|||
if(nrow(error.df)>0){ |
|||
write.csv(error.df, paste0("./contacts/summary/",substr(names(old_files[h]), 2, 3), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write |
|||
} |
|||
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal |
|||
} |
|||
write.csv(err.summ, |
|||
paste0("./contacts/summary/" ,substr(names(old_files[h]), 2, 3), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write |
|||
} |
|||
|
|||
end<-Sys.time() |
|||
|
|||
end-strt |
|||
|
|||
|
|||
``` |
|||
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).* |
|||
*Manually verify if these are the required templates* |
|||
|
@ -0,0 +1,402 @@ |
|||
--- |
|||
title: "Contacts" |
|||
author: "Scary Scarecrow" |
|||
date: "12/27/2021" |
|||
output: html_document |
|||
--- |
|||
|
|||
```{r setup, include=FALSE} |
|||
knitr::opts_chunk$set(echo = TRUE) |
|||
library(readxl) |
|||
library(dplyr) |
|||
library(lubridate) |
|||
library(DT) |
|||
library(tidyr) |
|||
|
|||
mutlstxlrdr<-function(){ |
|||
for( i in seq_along(sheet.na)){ |
|||
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header) |
|||
df<-read.table("", col.names = colnames) |
|||
assign(snames[i], df) |
|||
|
|||
} |
|||
} |
|||
``` |
|||
|
|||
|
|||
## Data transformation workflow |
|||
|
|||
Following is the proposed preliminary workflow for the data transformation project. |
|||
|
|||
|
|||
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there. |
|||
|
|||
>*Make sure that there are no hidden files inside the directory.* |
|||
|
|||
|
|||
### Code Lists |
|||
|
|||
|
|||
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE} |
|||
|
|||
filenames <- list.files("./contacts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.). |
|||
|
|||
# File paths |
|||
print(filenames) |
|||
``` |
|||
|
|||
|
|||
Check manually if the above list includes all the codelist files |
|||
If correct, then read the files. |
|||
|
|||
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE} |
|||
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names |
|||
codelist_files<-NULL |
|||
for(i in seq_along(filenames)){ |
|||
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files |
|||
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above |
|||
codelist_files<-c(codelist_files,a) |
|||
} |
|||
# Names of the files imported |
|||
names(codelist_files) |
|||
#codelist_files<-unique(codelist_files) |
|||
codelist_files$Academic_Title |
|||
``` |
|||
|
|||
|
|||
|
|||
### Templates |
|||
|
|||
|
|||
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system. |
|||
|
|||
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE} |
|||
|
|||
oldfilepath <- list.files("./contacts/raw-data/", pattern="*.xlsx", full.names = T) |
|||
print(oldfilepath) |
|||
``` |
|||
|
|||
Check it the list matches the actual files, manually. |
|||
|
|||
```{r readlegacyfiles, echo=TRUE} |
|||
|
|||
old_files<-NULL |
|||
|
|||
#read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
for(i in seq_along(oldfilepath)){ |
|||
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
} |
|||
|
|||
names(old_files)<-gsub("./contacts/raw-data/","",oldfilepath) |
|||
``` |
|||
|
|||
|
|||
*Some errors in the legacy file noticed. Columns with similar or same name exists.* |
|||
|
|||
|
|||
|
|||
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE} |
|||
saptemplate<-read_excel("./contacts/template.xlsx", sheet = "Field_Definitions") |
|||
# First few rows of the imported data |
|||
head(saptemplate) |
|||
``` |
|||
|
|||
|
|||
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction* |
|||
|
|||
|
|||
|
|||
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE} |
|||
#orilo<-"en_US.UTF-8" |
|||
#Sys.setlocale(locale="en_US.UTF-8") |
|||
strt<-Sys.time() |
|||
snames <- unique(saptemplate$`Sheet Name`) |
|||
|
|||
for (h in seq_along(old_files)) { |
|||
|
|||
# Copy original data |
|||
old.copy <- old_files[[h]] |
|||
print(paste0(names(old_files[h])," imported")) |
|||
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal |
|||
# Creates data frame for each sheet in snames |
|||
for (i in seq_along(snames)) { |
|||
print(paste0("Processing ..",snames[i])) |
|||
|
|||
# Select the column names from the field description sheet |
|||
print("Creating template") |
|||
sel.template.desc <- |
|||
saptemplate[saptemplate$`Sheet Name` == snames[i], ] |
|||
print("Creating column names") |
|||
sel.template.desc.colnames <- sel.template.desc$Header |
|||
|
|||
# Create a list by adding values from corresponding legacy data |
|||
temp <- NULL |
|||
print("adding values to template ") |
|||
for (j in seq_along(sel.template.desc.colnames)) { |
|||
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), |
|||
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]]) |
|||
) |
|||
|
|||
} |
|||
|
|||
# Rename the columns according to field description |
|||
print("renaming template ") |
|||
names(temp) <- sel.template.desc.colnames |
|||
|
|||
# Create data frame from the list |
|||
df <- as.data.frame(temp) |
|||
print("Converted to data frame") |
|||
|
|||
# Error summary file |
|||
Expected<-nrow(df) |
|||
|
|||
#Select essential rows |
|||
print("Identifying essential rows") |
|||
sel.template.desc |> |
|||
filter(Mandatory == "Yes") |> |
|||
pull(Header) -> essential.columns |
|||
|
|||
error.mandatory <- NULL |
|||
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL) |
|||
# Operate on essential columns including creation of error file |
|||
for (k in seq_along(essential.columns)) { |
|||
|
|||
if(essential.columns[k]=="International_Version"){ |
|||
print("Found International Version. Adding 0.") |
|||
#stop() |
|||
df$International_Version<-"0" |
|||
} |
|||
|
|||
print("Creating and writing data with missing mandatory values") |
|||
assign( |
|||
paste0( |
|||
"error_mandatory_", |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k] |
|||
), |
|||
df[is.na(df[, essential.columns[k]]), ] |
|||
) |
|||
# TO be saved in error files |
|||
|
|||
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){ |
|||
write.csv( |
|||
df[is.na(df[, essential.columns[k]]), ], |
|||
paste0( |
|||
"./contacts/errors/mandatory/", |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k], |
|||
"_error_mandatory.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
# Error summary file |
|||
Country<-substr(names(old_files[h]), 2, 3) |
|||
Name<-snames[i] |
|||
err.type<-paste0("Missing ",essential.columns[k]) |
|||
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ]) |
|||
|
|||
|
|||
print("Removing rows with empty essetial columns") |
|||
df <- df[!is.na(df[, essential.columns[k]]), ] |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
} |
|||
|
|||
print("Identifying columns associated with codelists") |
|||
# List of columns that have a codelist |
|||
codelistcols <- sel.template.desc |> |
|||
filter(!is.na(`CodeList File Path`)) |> pull(Header) |
|||
for (k in seq_along(codelistcols)) { |
|||
if(codelistcols[k]=="International_Version"){ |
|||
print("Found International Version. Adding 0.") |
|||
df$International_Version<-"0" |
|||
} |
|||
|
|||
print(paste0("Identifying errors ",codelistcols[k])) |
|||
def.rows <- |
|||
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA)) |
|||
def.n<- df[def.rows, 1] |
|||
def.rows.val <- |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] |
|||
def <- data.frame(def.rows, def.n,def.rows.val) |
|||
if(nrow(def>0)){ |
|||
assign(paste0( |
|||
"error_codematch_", |
|||
substr(names(old_files[1]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k] |
|||
), |
|||
def) # TO be saved |
|||
write.csv( |
|||
def, |
|||
paste0( |
|||
"./contacts/errors/codelist/", |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k], |
|||
"_error_codematch_.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal |
|||
err.count<-nrow(def) #Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
print(paste0("Removing errors ",codelistcols[k])) |
|||
# Removes any mismatch |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <- |
|||
NA |
|||
|
|||
# Matches each column with the corresponding code list and returns the value |
|||
df[, codelistcols[k]] <- |
|||
as.character(pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]), |
|||
pull(codelist_files[codelistcols[k]][[1]], Description))]) |
|||
|
|||
} |
|||
max.length <- as.numeric(sel.template.desc$`Max Length`) |
|||
dtype <- sel.template.desc$`Data Type` |
|||
rowval <- NULL |
|||
ival <- NULL |
|||
rval <- NULL |
|||
lenght.issue.df <- NULL |
|||
# Changing the data class |
|||
for (k in 1:ncol(df)) { |
|||
|
|||
if (dtype[k] == "String") { |
|||
df[, k] <- as.character(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Boolean") { |
|||
df[, k] <- as.logical(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "DateTime") { |
|||
df[, k] <- lubridate::ymd_hms(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Time") { |
|||
df[, k] <- lubridate::hms(pull(df, k)) |
|||
|
|||
} # This list will increase and also change based on input date and time formats |
|||
|
|||
|
|||
|
|||
} |
|||
|
|||
print("Rectifying streetname") |
|||
# Street and House Number |
|||
if (any(colnames(df) == "Street")) { |
|||
df$Streetname<-NA |
|||
df$HouseNumber<-NA |
|||
extract(df, |
|||
"Street", |
|||
c("Streetname", "HouseNumber"), |
|||
"(\\D+)(\\d.*)") |
|||
df <- df |> |
|||
select(-c("Street", "House_Number")) |> |
|||
rename(Street = Streetname, House_Number = HouseNumber) |> |
|||
select(all_of(sel.template.desc.colnames)) |
|||
} |
|||
|
|||
# Length Rectification |
|||
colclasses <- lapply(df, class) |
|||
print("Rectifying Length") |
|||
for (k in 1:ncol(df)) { |
|||
if (colclasses[[k]] == "character") { |
|||
print("found character column ") |
|||
rowval <- pull(df, 1) |
|||
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k))) |
|||
rval <- max.length[k] |
|||
# rectifying data length |
|||
df[, k] <- |
|||
ifelse(nchar(pull(df, k)) > max.length[k], |
|||
substring(pull(df, k), 1, max.length[k]), |
|||
pull(df, k)) |
|||
} |
|||
|
|||
lenght.issue.df <- |
|||
rbind(lenght.issue.df, data.frame(rowval, ival, rval)) |
|||
|
|||
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal |
|||
err.count<- sum(ival>rval, na.rm = T) # Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
} |
|||
|
|||
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval) |
|||
|
|||
|
|||
if(nrow(lenght.issue.df)>0){ |
|||
write.csv(lenght.issue.df, |
|||
paste0( |
|||
"./contacts/errors/length/", |
|||
substr(names(old_files[h]), 2, 3), |
|||
"_", |
|||
snames[i], |
|||
"_length_error.csv" |
|||
), row.names = F, na="") |
|||
} |
|||
|
|||
assign(snames[i], df) |
|||
write.csv(df,paste0("./contacts/output/", substr(names(old_files[h]), 2, 3), "_", snames[i],".csv"), row.names = F, na="") |
|||
if(nrow(error.df)>0){ |
|||
write.csv(error.df, paste0("./contacts/summary/",substr(names(old_files[h]), 2, 3), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write |
|||
} |
|||
|
|||
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal |
|||
|
|||
} |
|||
write.csv(err.summ, |
|||
paste0("./contacts/summary/" ,substr(names(old_files[h]), 2, 3), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write |
|||
} |
|||
|
|||
end<-Sys.time() |
|||
|
|||
end-strt |
|||
|
|||
|
|||
``` |
|||
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).* |
|||
*Manually verify if these are the required templates* |
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
@ -0,0 +1,13 @@ |
|||
Version: 1.0 |
|||
|
|||
RestoreWorkspace: Default |
|||
SaveWorkspace: Default |
|||
AlwaysSaveHistory: Default |
|||
|
|||
EnableCodeIndexing: Yes |
|||
UseSpacesForTab: Yes |
|||
NumSpacesForTab: 2 |
|||
Encoding: UTF-8 |
|||
|
|||
RnwWeave: Sweave |
|||
LaTeX: pdfLaTeX |
|||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -0,0 +1,373 @@ |
|||
--- |
|||
title: "Projects" |
|||
author: "Scary Scarecrow" |
|||
date: "1/12/2022" |
|||
output: html_document |
|||
--- |
|||
|
|||
```{r setup, include=FALSE} |
|||
knitr::opts_chunk$set(echo = TRUE) |
|||
library(readxl) |
|||
library(dplyr) |
|||
library(lubridate) |
|||
library(DT) |
|||
library(tidyr) |
|||
|
|||
mutlstxlrdr<-function(){ |
|||
for( i in seq_along(sheet.na)){ |
|||
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header) |
|||
df<-read.table("", col.names = colnames) |
|||
assign(snames[i], df) |
|||
|
|||
} |
|||
} |
|||
``` |
|||
|
|||
|
|||
## Data transformation workflow |
|||
|
|||
Following is the proposed preliminary workflow for the data transformation project. |
|||
|
|||
|
|||
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there. |
|||
|
|||
>*Make sure that there are no hidden files inside the directory.* |
|||
|
|||
|
|||
### Code Lists |
|||
|
|||
|
|||
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE} |
|||
|
|||
filenames <- list.files("./projects/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.). |
|||
|
|||
# File paths |
|||
print(filenames) |
|||
``` |
|||
|
|||
|
|||
Check manually if the above list includes all the codelist files |
|||
If correct, then read the files. |
|||
|
|||
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE} |
|||
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names |
|||
codelist_files<-NULL |
|||
for(i in seq_along(filenames)){ |
|||
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files |
|||
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above |
|||
codelist_files<-c(codelist_files,a) |
|||
} |
|||
# Names of the files imported |
|||
names(codelist_files) |
|||
#codelist_files<-unique(codelist_files) |
|||
codelist_files$Academic_Title |
|||
``` |
|||
|
|||
|
|||
|
|||
### Templates |
|||
|
|||
|
|||
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system. |
|||
|
|||
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE} |
|||
oldfilepath<-list.files("./projects/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern |
|||
print(oldfilepath) |
|||
``` |
|||
|
|||
Check it the list matches the actual files, manually. |
|||
|
|||
```{r readlegacyfiles, echo=TRUE} |
|||
|
|||
old_files<-NULL |
|||
|
|||
#read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
for(i in seq_along(oldfilepath)){ |
|||
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
} |
|||
|
|||
names(old_files)<-gsub("./projects/raw-data/","",oldfilepath) |
|||
``` |
|||
|
|||
|
|||
*Some errors in the legacy file noticed. Columns with similar or same name exists.* |
|||
|
|||
|
|||
|
|||
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE} |
|||
saptemplate<-read_excel("./projects/template.xlsx", sheet = "Field_Definitions") |
|||
# First few rows of the imported data |
|||
head(saptemplate) |
|||
``` |
|||
|
|||
|
|||
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction* |
|||
|
|||
|
|||
|
|||
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE} |
|||
#orilo<-"en_US.UTF-8" |
|||
#Sys.setlocale(locale="en_US.UTF-8") |
|||
strt<-Sys.time() |
|||
snames <- unique(saptemplate$`Sheet Name`) |
|||
|
|||
for (h in seq_along(old_files)) { |
|||
|
|||
# Copy original data |
|||
old.copy <- old_files[[h]] |
|||
print(paste0(names(old_files[h])," imported")) |
|||
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal |
|||
# Creates data frame for each sheet in snames |
|||
for (i in seq_along(snames)) { |
|||
print(paste0("Processing ..",snames[i])) |
|||
|
|||
# Select the column names from the field description sheet |
|||
print("Creating template") |
|||
sel.template.desc <- |
|||
saptemplate[saptemplate$`Sheet Name` == snames[i], ] |
|||
print("Creating column names") |
|||
sel.template.desc.colnames <- sel.template.desc$Header |
|||
|
|||
# Create a list by adding values from corresponding legacy data |
|||
temp <- NULL |
|||
print("adding values to template ") |
|||
for (j in seq_along(sel.template.desc.colnames)) { |
|||
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), |
|||
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]]) |
|||
) |
|||
|
|||
} |
|||
|
|||
# Rename the columns according to field description |
|||
print("renaming template ") |
|||
names(temp) <- sel.template.desc.colnames |
|||
|
|||
# Create data frame from the list |
|||
df <- as.data.frame(temp) |
|||
print("Converted to data frame") |
|||
|
|||
# Error summary file |
|||
Expected<-nrow(df) |
|||
|
|||
#Select essential rows |
|||
print("Identifying essential rows") |
|||
sel.template.desc |> |
|||
filter(Mandatory == "Yes") |> |
|||
pull(Header) -> essential.columns |
|||
|
|||
error.mandatory <- NULL |
|||
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL) |
|||
# Operate on essential columns including creation of error file |
|||
for (k in seq_along(essential.columns)) { |
|||
|
|||
if(essential.columns[k]=="Currency"){ |
|||
print("Found Currency. Adding 0.") |
|||
#stop() |
|||
df$International_Version<-"CHF" |
|||
} |
|||
|
|||
print("Creating and writing data with missing mandatory values") |
|||
assign( |
|||
paste0( |
|||
"error_mandatory_", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k] |
|||
), |
|||
df[is.na(df[, essential.columns[k]]), ] |
|||
) |
|||
# TO be saved in error files |
|||
|
|||
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){ |
|||
write.csv( |
|||
df[is.na(df[, essential.columns[k]]), ], |
|||
paste0( |
|||
"./projects/errors/mandatory/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k], |
|||
"_error_mandatory.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
# Error summary file |
|||
Country<-substr(names(old_files[h]), 1, 2) |
|||
Name<-snames[i] |
|||
err.type<-paste0("Missing ",essential.columns[k]) |
|||
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ]) |
|||
|
|||
|
|||
print("Removing rows with empty essetial columns") |
|||
df <- df[!is.na(df[, essential.columns[k]]), ] |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
} |
|||
|
|||
print("Identifying columns associated with codelists") |
|||
# List of columns that have a codelist |
|||
codelistcols <- sel.template.desc |> |
|||
filter(!is.na(`CodeList File Path`)) |> pull(Header) |
|||
for (k in seq_along(codelistcols)) { |
|||
if(codelistcols[k]=="Currency"){ |
|||
print("Found Currency. Adding 0.") |
|||
df$International_Version<-"CHF" |
|||
} |
|||
|
|||
print(paste0("Identifying errors ",codelistcols[k])) |
|||
def.rows <- |
|||
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA)) |
|||
def.n<- df[def.rows, 1] |
|||
def.rows.val <- |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] |
|||
def <- data.frame(def.rows, def.n,def.rows.val) |
|||
if(nrow(def>0)){ |
|||
assign(paste0( |
|||
"error_codematch_", |
|||
substr(names(old_files[1]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k] |
|||
), |
|||
def) # TO be saved |
|||
write.csv( |
|||
def, |
|||
paste0( |
|||
"./projects/errors/codelist/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k], |
|||
"_error_codematch_.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal |
|||
err.count<-nrow(def) #Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
print(paste0("Removing errors ",codelistcols[k])) |
|||
# Removes any mismatch |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <- |
|||
NA |
|||
|
|||
# Matches each column with the corresponding code list and returns the value |
|||
df[, codelistcols[k]] <- |
|||
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]), |
|||
pull(codelist_files[codelistcols[k]][[1]], Description))] |
|||
|
|||
} |
|||
max.length <- as.numeric(sel.template.desc$`Max Length`) |
|||
dtype <- sel.template.desc$`Data Type` |
|||
rowval <- NULL |
|||
ival <- NULL |
|||
rval <- NULL |
|||
lenght.issue.df <- NULL |
|||
# Changing the data class |
|||
for (k in 1:ncol(df)) { |
|||
|
|||
if (dtype[k] == "String") { |
|||
df[, k] <- as.character(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Boolean") { |
|||
df[, k] <- as.logical(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "DateTime") { |
|||
df[, k] <- lubridate::ymd_hms(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Time") { |
|||
df[, k] <- lubridate::hms(pull(df, k)) |
|||
|
|||
} # This list will increase and also change based on input date and time formats |
|||
|
|||
|
|||
|
|||
} |
|||
|
|||
print("Rectifying streetname") |
|||
# Street and House Number |
|||
if (any(colnames(df) == "Street")) { |
|||
df$Streetname<-NA |
|||
df$HouseNumber<-NA |
|||
extract(df, |
|||
"Street", |
|||
c("Streetname", "HouseNumber"), |
|||
"(\\D+)(\\d.*)") |
|||
df <- df |> |
|||
select(-c("Street", "House_Number")) |> |
|||
rename(Street = Streetname, House_Number = HouseNumber) |> |
|||
select(sel.template.desc.colnames) |
|||
} |
|||
|
|||
# Length Rectification |
|||
colclasses <- lapply(df, class) |
|||
print("Rectifying Length") |
|||
for (k in 1:ncol(df)) { |
|||
if (colclasses[[k]] == "character") { |
|||
print("found character column ") |
|||
rowval <- pull(df, 1) |
|||
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k))) |
|||
rval <- max.length[k] |
|||
# rectifying data length |
|||
df[, k] <- |
|||
ifelse(nchar(pull(df, k)) > max.length[k], |
|||
substring(pull(df, k), 1, max.length[k]), |
|||
pull(df, k)) |
|||
} |
|||
|
|||
lenght.issue.df <- |
|||
rbind(lenght.issue.df, data.frame(rowval, ival, rval)) |
|||
|
|||
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal |
|||
err.count<- sum(ival>rval, na.rm = T) # Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
} |
|||
|
|||
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval) |
|||
|
|||
|
|||
if(nrow(lenght.issue.df)>0){ |
|||
write.csv(lenght.issue.df, |
|||
paste0( |
|||
"./projects/errors/length/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_length_error.csv" |
|||
), row.names = F, na="") |
|||
} |
|||
|
|||
assign(snames[i], df) |
|||
write.csv(df,paste0("./projects/output/", substr(names(old_files[h]), 1, 2), "_", snames[i],".csv"), row.names = F, na="") |
|||
if(nrow(error.df)>0){ |
|||
write.csv(error.df, paste0("./projects/summary/",substr(names(old_files[h]), 1, 2), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write |
|||
} |
|||
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal |
|||
|
|||
} |
|||
write.csv(err.summ, |
|||
paste0("./projects/summary/" ,substr(names(old_files[h]), 1, 2), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write |
|||
} |
|||
|
|||
end<-Sys.time() |
|||
|
|||
end-strt |
|||
|
|||
|
|||
``` |
|||
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).* |
|||
*Manually verify if these are the required templates* |
|||
|
|||
@ -0,0 +1,398 @@ |
|||
--- |
|||
title: "Report" |
|||
author: "Data Science Team, LaNubia" |
|||
date: "1/11/2022" |
|||
output: |
|||
html_document: |
|||
theme: lumen |
|||
highlight: tango |
|||
self_contained: true |
|||
|
|||
--- |
|||
|
|||
```{r setup, include=FALSE} |
|||
knitr::opts_chunk$set(echo = FALSE, error=TRUE, message=FALSE, warning=FALSE) |
|||
library(readxl) |
|||
library(DT) |
|||
library(tidyr) |
|||
library(dplyr) |
|||
|
|||
rxl<- function(path,...){ |
|||
tryCatch(read_excel(path,...), error= function(c){ |
|||
c$message<-"No Data" |
|||
print("No Data") |
|||
stop(c) |
|||
}) |
|||
} |
|||
|
|||
ltodf<- function(path,...){ |
|||
tryCatch(rbind.data.frame(path,...), error= function(c){ |
|||
c$message<-"No Data" |
|||
print("No Data") |
|||
stop(c) |
|||
}) |
|||
} |
|||
|
|||
``` |
|||
|
|||
## Status Report |
|||
|
|||
|
|||
### Input Available |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
contactinputpath<-list.files("./contacts/raw-data", pattern="*.xlsx", full.names = T) |
|||
accountinputpath<-list.files("./accounts/raw-data", pattern="*.xls", full.names = T) |
|||
projectinputpath<-list.files("./projects/raw-data", pattern="*.xls", full.names = T) |
|||
supportinputpath<-list.files("./support/raw-data", pattern="*.xls", full.names = T) |
|||
|
|||
|
|||
conta<-lapply(contactinputpath, read_excel) |
|||
names(conta)<-gsub("./contacts/raw-data/","",contactinputpath) |
|||
c<-lapply(conta, nrow) |
|||
|
|||
Input_data<-"Contact" |
|||
#Country<-gsub(".xlsx","",names(conta)) |
|||
Observations<-c |
|||
|
|||
temp<-data.frame(Input_data,Observations) |> |
|||
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |> |
|||
mutate(Country=gsub(".xlsx","",Country)) |
|||
|
|||
input.summary<-temp |
|||
|
|||
acco<-lapply(accountinputpath, read_excel) |
|||
names(acco)<-gsub("./accounts/raw-data/","",accountinputpath) |
|||
a<-lapply(acco, nrow) |
|||
|
|||
Input_data<-"Accounts" |
|||
#Country<-gsub(".xlsx","",names(conta)) |
|||
Observations<-a |
|||
|
|||
temp<-data.frame(Input_data,Observations) |> |
|||
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |> |
|||
mutate(Country=gsub(".xls","",Country)) |
|||
|
|||
input.summary<-rbind(input.summary,temp) |
|||
|
|||
proja<-lapply(projectinputpath, read_excel) |
|||
names(proja)<-gsub("./projects/raw-data/","",projectinputpath) |
|||
p<-lapply(proja, nrow) |
|||
|
|||
Input_data<-"Projects" |
|||
#Country<-gsub(".xlsx","",names(conta)) |
|||
Observations<-p |
|||
|
|||
temp<-data.frame(Input_data,Observations) |> |
|||
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |> |
|||
mutate(Country=gsub(".xls","",Country)) |
|||
|
|||
input.summary<-rbind(input.summary,temp) |
|||
|
|||
suppo<-lapply(supportinputpath, read_excel) |
|||
names(suppo)<-gsub("./support/raw-data/","",supportinputpath) |
|||
s<-lapply(suppo, nrow) |
|||
|
|||
Input_data<-"Support" |
|||
#Country<-gsub(".xlsx","",names(conta)) |
|||
Observations<-s |
|||
|
|||
temp<-data.frame(Input_data,Observations) |> |
|||
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |> |
|||
mutate(Country=gsub(".xls","",Country)) |
|||
|
|||
input.summary<-rbind(input.summary,temp) |
|||
|
|||
datatable(input.summary, extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
|
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
Simplified view |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE} |
|||
input.summary |> |
|||
pivot_wider(names_from = Country, values_from = Observations) |> datatable(extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
### Contacts |
|||
|
|||
|
|||
|
|||
#### Template |
|||
|
|||
|
|||
|
|||
SAP templates available: |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE} |
|||
datatable(data.frame(Templates=unique(rxl("./contacts/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
#### Summary of Errors |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
sumerrfilepath<-list.files("./contacts/summary", pattern="*sumerror.csv", full.names = T) |
|||
errfilepath<-list.files("./contacts/summary", pattern="*_error.csv", full.names = T) |
|||
|
|||
sumerrfiles<-lapply(sumerrfilepath, read.csv) |
|||
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
|
|||
``` |
|||
|
|||
|
|||
|
|||
#### Error by template |
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
errfiles<-lapply(errfilepath, read.csv) |
|||
datatable(do.call(ltodf, errfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
### Accounts |
|||
|
|||
|
|||
#### Template |
|||
|
|||
SAP templates available: |
|||
|
|||
```{r echo=FALSE} |
|||
datatable(data.frame(Templates=unique(rxl("./accounts/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
#### Summary of Errors |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
sumerrfilepath<-list.files("./accounts/summary", pattern="*sumerror.csv", full.names = T) |
|||
errfilepath<-list.files("./accounts/summary", pattern="*_error.csv", full.names = T) |
|||
|
|||
sumerrfiles<-lapply(sumerrfilepath, read.csv) |
|||
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
|
|||
``` |
|||
|
|||
|
|||
|
|||
#### Error by template |
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
errfiles<-lapply(errfilepath, read.csv) |
|||
datatable(do.call(ltodf, errfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
### Projects |
|||
|
|||
|
|||
#### Template |
|||
|
|||
SAP templates available: |
|||
|
|||
```{r echo=FALSE} |
|||
datatable(data.frame(Templates=unique(rxl("./projects/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
#### Summary of Errors |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
sumerrfilepath<-list.files("./projects/summary", pattern="*sumerror.csv", full.names = T) |
|||
errfilepath<-list.files("./projects/summary", pattern="*_error.csv", full.names = T) |
|||
|
|||
sumerrfiles<-lapply(sumerrfilepath, read.csv) |
|||
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
|
|||
``` |
|||
|
|||
|
|||
|
|||
#### Error by template |
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
errfiles<-lapply(errfilepath, read.csv) |
|||
datatable(do.call(ltodf, errfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
### Support |
|||
|
|||
|
|||
#### Template |
|||
|
|||
SAP templates available: |
|||
|
|||
```{r echo=FALSE} |
|||
datatable(data.frame(Templates=unique(rxl("./support/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
|
|||
|
|||
|
|||
#### Summary of Errors |
|||
|
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
sumerrfilepath<-list.files("./support/summary", pattern="*sumerror.csv", full.names = T) |
|||
errfilepath<-list.files("./support/summary", pattern="*_error.csv", full.names = T) |
|||
|
|||
sumerrfiles<-lapply(sumerrfilepath, read.csv) |
|||
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
|
|||
``` |
|||
|
|||
|
|||
|
|||
#### Error by template |
|||
|
|||
|
|||
```{r echo=FALSE, message=FALSE, warning=FALSE} |
|||
errfiles<-lapply(errfilepath, read.csv) |
|||
datatable(do.call(ltodf, errfiles), extensions = "Buttons", |
|||
options = list(paging = TRUE, |
|||
scrollX=TRUE, |
|||
searching = TRUE, |
|||
ordering = TRUE, |
|||
dom = 'Bfrtip', |
|||
buttons = c('copy', 'csv', 'excel', 'pdf'), |
|||
pageLength=5, |
|||
lengthMenu=c(3,5,10) )) |
|||
``` |
|||
|
|||
File diff suppressed because one or more lines are too long
Binary file not shown.
@ -0,0 +1,374 @@ |
|||
--- |
|||
title: "Support" |
|||
author: "Scary Scarecrow" |
|||
date: "1/12/2022" |
|||
output: html_document |
|||
--- |
|||
|
|||
|
|||
```{r setup, include=FALSE} |
|||
knitr::opts_chunk$set(echo = TRUE) |
|||
library(readxl) |
|||
library(dplyr) |
|||
library(lubridate) |
|||
library(DT) |
|||
library(tidyr) |
|||
|
|||
mutlstxlrdr<-function(){ |
|||
for( i in seq_along(sheet.na)){ |
|||
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header) |
|||
df<-read.table("", col.names = colnames) |
|||
assign(snames[i], df) |
|||
|
|||
} |
|||
} |
|||
``` |
|||
|
|||
|
|||
## Data transformation workflow |
|||
|
|||
Following is the proposed preliminary workflow for the data transformation project. |
|||
|
|||
|
|||
>All file of a segment (support/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there. |
|||
|
|||
>*Make sure that there are no hidden files inside the directory.* |
|||
|
|||
|
|||
### Code Lists |
|||
|
|||
|
|||
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE} |
|||
|
|||
filenames <- list.files("./support/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (support, accounts etc.). |
|||
|
|||
# File paths |
|||
print(filenames) |
|||
``` |
|||
|
|||
|
|||
Check manually if the above list includes all the codelist files |
|||
If correct, then read the files. |
|||
|
|||
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE} |
|||
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names |
|||
codelist_files<-NULL |
|||
for(i in seq_along(filenames)){ |
|||
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files |
|||
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above |
|||
codelist_files<-c(codelist_files,a) |
|||
} |
|||
# Names of the files imported |
|||
names(codelist_files) |
|||
#codelist_files<-unique(codelist_files) |
|||
codelist_files$Academic_Title |
|||
``` |
|||
|
|||
|
|||
|
|||
### Templates |
|||
|
|||
|
|||
Let us now extract the data. Below we are reading only one file having all data related to `support` from the legacy system. |
|||
|
|||
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE} |
|||
oldfilepath<-list.files("./support/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern |
|||
print(oldfilepath) |
|||
``` |
|||
|
|||
Check it the list matches the actual files, manually. |
|||
|
|||
```{r readlegacyfiles, echo=TRUE} |
|||
|
|||
old_files<-NULL |
|||
|
|||
#read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
for(i in seq_along(oldfilepath)){ |
|||
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1) |
|||
} |
|||
|
|||
names(old_files)<-gsub("./support/raw-data/","",oldfilepath) |
|||
``` |
|||
|
|||
|
|||
*Some errors in the legacy file noticed. Columns with similar or same name exists.* |
|||
|
|||
|
|||
|
|||
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE} |
|||
saptemplate<-read_excel("./support/template.xlsx", sheet = "Field_Definitions") |
|||
# First few rows of the imported data |
|||
head(saptemplate) |
|||
``` |
|||
|
|||
|
|||
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction* |
|||
|
|||
|
|||
|
|||
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE} |
|||
#orilo<-"en_US.UTF-8" |
|||
#Sys.setlocale(locale="en_US.UTF-8") |
|||
strt<-Sys.time() |
|||
snames <- unique(saptemplate$`Sheet Name`) |
|||
|
|||
for (h in seq_along(old_files)) { |
|||
|
|||
# Copy original data |
|||
old.copy <- old_files[[h]] |
|||
print(paste0(names(old_files[h])," imported")) |
|||
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal |
|||
# Creates data frame for each sheet in snames |
|||
for (i in seq_along(snames)) { |
|||
print(paste0("Processing ..",snames[i])) |
|||
|
|||
# Select the column names from the field description sheet |
|||
print("Creating template") |
|||
sel.template.desc <- |
|||
saptemplate[saptemplate$`Sheet Name` == snames[i], ] |
|||
print("Creating column names") |
|||
sel.template.desc.colnames <- sel.template.desc$Header |
|||
|
|||
# Create a list by adding values from corresponding legacy data |
|||
temp <- NULL |
|||
print("adding values to template ") |
|||
for (j in seq_along(sel.template.desc.colnames)) { |
|||
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]), |
|||
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]]) |
|||
) |
|||
|
|||
} |
|||
|
|||
# Rename the columns according to field description |
|||
print("renaming template ") |
|||
names(temp) <- sel.template.desc.colnames |
|||
|
|||
# Create data frame from the list |
|||
df <- as.data.frame(temp) |
|||
print("Converted to data frame") |
|||
|
|||
# Error summary file |
|||
Expected<-nrow(df) |
|||
|
|||
#Select essential rows |
|||
print("Identifying essential rows") |
|||
sel.template.desc |> |
|||
filter(Mandatory == "Yes") |> |
|||
pull(Header) -> essential.columns |
|||
|
|||
error.mandatory <- NULL |
|||
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL) |
|||
# Operate on essential columns including creation of error file |
|||
for (k in seq_along(essential.columns)) { |
|||
|
|||
if(essential.columns[k]=="International_Version"){ |
|||
print("Found International Version. Adding 0.") |
|||
#stop() |
|||
df$International_Version<-"0" |
|||
} |
|||
|
|||
print("Creating and writing data with missing mandatory values") |
|||
assign( |
|||
paste0( |
|||
"error_mandatory_", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k] |
|||
), |
|||
df[is.na(df[, essential.columns[k]]), ] |
|||
) |
|||
# TO be saved in error files |
|||
|
|||
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){ |
|||
write.csv( |
|||
df[is.na(df[, essential.columns[k]]), ], |
|||
paste0( |
|||
"./support/errors/mandatory/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
essential.columns[k], |
|||
"_error_mandatory.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
# Error summary file |
|||
Country<-substr(names(old_files[h]), 1, 2) |
|||
Name<-snames[i] |
|||
err.type<-paste0("Missing ",essential.columns[k]) |
|||
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ]) |
|||
|
|||
|
|||
print("Removing rows with empty essetial columns") |
|||
df <- df[!is.na(df[, essential.columns[k]]), ] |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
} |
|||
|
|||
print("Identifying columns associated with codelists") |
|||
# List of columns that have a codelist |
|||
codelistcols <- sel.template.desc |> |
|||
filter(!is.na(`CodeList File Path`)) |> pull(Header) |
|||
for (k in seq_along(codelistcols)) { |
|||
if(codelistcols[k]=="International_Version"){ |
|||
print("Found International Version. Adding 0.") |
|||
df$International_Version<-"0" |
|||
} |
|||
|
|||
print(paste0("Identifying errors ",codelistcols[k])) |
|||
def.rows <- |
|||
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA)) |
|||
def.n<- df[def.rows, 1] |
|||
def.rows.val <- |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] |
|||
def <- data.frame(def.rows, def.n,def.rows.val) |
|||
if(nrow(def>0)){ |
|||
assign(paste0( |
|||
"error_codematch_", |
|||
substr(names(old_files[1]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k] |
|||
), |
|||
def) # TO be saved |
|||
write.csv( |
|||
def, |
|||
paste0( |
|||
"./support/errors/codelist/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_", |
|||
codelistcols[k], |
|||
"_error_codematch_.csv" |
|||
), row.names = F, na="" |
|||
) |
|||
} |
|||
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal |
|||
err.count<-nrow(def) #Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
print(paste0("Removing errors ",codelistcols[k])) |
|||
# Removes any mismatch |
|||
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <- |
|||
NA |
|||
|
|||
# Matches each column with the corresponding code list and returns the value |
|||
df[, codelistcols[k]] <- |
|||
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]), |
|||
pull(codelist_files[codelistcols[k]][[1]], Description))] |
|||
|
|||
} |
|||
max.length <- as.numeric(sel.template.desc$`Max Length`) |
|||
dtype <- sel.template.desc$`Data Type` |
|||
rowval <- NULL |
|||
ival <- NULL |
|||
rval <- NULL |
|||
lenght.issue.df <- NULL |
|||
# Changing the data class |
|||
for (k in 1:ncol(df)) { |
|||
|
|||
if (dtype[k] == "String") { |
|||
df[, k] <- as.character(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Boolean") { |
|||
df[, k] <- as.logical(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "DateTime") { |
|||
df[, k] <- lubridate::ymd_hms(pull(df, k)) |
|||
} |
|||
if (dtype[k] == "Time") { |
|||
df[, k] <- lubridate::hms(pull(df, k)) |
|||
|
|||
} # This list will increase and also change based on input date and time formats |
|||
|
|||
|
|||
|
|||
} |
|||
|
|||
print("Rectifying streetname") |
|||
# Street and House Number |
|||
if (any(colnames(df) == "Street")) { |
|||
df$Streetname<-NA |
|||
df$HouseNumber<-NA |
|||
extract(df, |
|||
"Street", |
|||
c("Streetname", "HouseNumber"), |
|||
"(\\D+)(\\d.*)") |
|||
df <- df |> |
|||
select(-c("Street", "House_Number")) |> |
|||
rename(Street = Streetname, House_Number = HouseNumber) |> |
|||
select(sel.template.desc.colnames) |
|||
} |
|||
|
|||
# Length Rectification |
|||
colclasses <- lapply(df, class) |
|||
print("Rectifying Length") |
|||
for (k in 1:ncol(df)) { |
|||
if (colclasses[[k]] == "character") { |
|||
print("found character column ") |
|||
rowval <- pull(df, 1) |
|||
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k))) |
|||
rval <- max.length[k] |
|||
# rectifying data length |
|||
df[, k] <- |
|||
ifelse(nchar(pull(df, k)) > max.length[k], |
|||
substring(pull(df, k), 1, max.length[k]), |
|||
pull(df, k)) |
|||
} |
|||
|
|||
lenght.issue.df <- |
|||
rbind(lenght.issue.df, data.frame(rowval, ival, rval)) |
|||
|
|||
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal |
|||
err.count<- sum(ival>rval, na.rm = T) # Error cal |
|||
if(err.count>0){ |
|||
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal |
|||
} |
|||
|
|||
|
|||
} |
|||
|
|||
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval) |
|||
|
|||
|
|||
if(nrow(lenght.issue.df)>0){ |
|||
write.csv(lenght.issue.df, |
|||
paste0( |
|||
"./support/errors/length/", |
|||
substr(names(old_files[h]), 1, 2), |
|||
"_", |
|||
snames[i], |
|||
"_length_error.csv" |
|||
), row.names = F, na="") |
|||
} |
|||
|
|||
assign(snames[i], df) |
|||
write.csv(df,paste0("./support/output/", substr(names(old_files[h]), 1, 2), "_", snames[i],".csv"), row.names = F, na="") |
|||
if(nrow(error.df)>0){ |
|||
write.csv(error.df, paste0("./support/summary/",substr(names(old_files[h]), 1, 2), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write |
|||
} |
|||
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal |
|||
|
|||
} |
|||
write.csv(err.summ, |
|||
paste0("./support/summary/" ,substr(names(old_files[h]), 1, 2), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write |
|||
} |
|||
|
|||
end<-Sys.time() |
|||
|
|||
end-strt |
|||
|
|||
|
|||
``` |
|||
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).* |
|||
*Manually verify if these are the required templates* |
|||
|
|||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because one or more lines are too long
Binary file not shown.
File diff suppressed because one or more lines are too long
@ -0,0 +1 @@ |
|||
<?xml version="1.0" encoding="utf-8"?><Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default Extension="xml" ContentType="application/octet-stream" /></Types> |
|||
@ -0,0 +1 @@ |
|||
<?xml version="1.0" encoding="utf-8"?><Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default Extension="xml" ContentType="application/octet-stream" /></Types> |
|||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Some files were not shown because too many files changed in this diff
Loading…
Reference in new issue