[MATLAB Data Processing 기초 2] Folder 만들기 및 Navigate to folder

MATLAB - Data Processing 기초

[MATLAB Data Processing 기초 2] Folder 만들기 및 Navigate to folder - 2 ( dir(), regexp(), strsplit() )

toyprojects 2023. 8. 17. 02:50

앞 글 " [MATLAB Data Processing 기초 2] Folder 만들기 및 Navigate to folder - 1 ( cd(), mkdir() ) " 에서 다룬

상하위 폴더의 생성조건에 대해 다시 언급하자면,

프로그래밍에 앞서 폴더의 생성조건을 다시 요약하자면,

개체이름: Animal1 부터 Animal5 까지

날짜이름: Day1 부터 Day10 까지

폴더경로: top level folder\Animal\Day\export data.xlsx (OS system이 Windows의 경우)

top level folder/Animal/Day/export data.xlsx ( Mac, Linux의 경우)

파일이름: export data.xlsx

예외적인 상황으로 인해 데이터가 2개 이상의 파일 - 'export data-trial1.xlsx', 'export data-trial2.xlsx' 등 - 로

나뉘어질 수 있고, human error 로써 폴더이름 첫글자가 대문자가 아닌 소문자 라던가 글자와 숫자 사이에

빈 공간이 있는것 또는 오타가 발생할 수도 있습니다.

Top Level Folder \ Animal1 \ day 1 \ export data.xlsx

Top Level Folder \ animal1 \ Day2 \ export data-trial1.xlsx

Top Level Folder \ animal1 \ Day2 \ export data-trial2.xlsx

Top Level Folder \ Animal 1 \ Day 3 \ export data.xlsx

................................................................

단지 모든 하위폴더에서 - 상하위 폴더 이름 상관없이 - ' *.xlsx ' 확장자를 가진 파일만을 찾는다면 간단하게

dir( ) 명령어로 해결할 수 있습니다.

file = dir('**\*.xlsx');

위와같이, dir( ) 명령어는 struct 타입의 결과값을 반환하며, 찾고자 하는 ' *.xlsx ' 확장자를 가진 파일의

파일명, 경로 등을 알려줍니다.

여기서 올바른 파일 찾기 방법은, 지정된 범위의 개체명(상위 폴더), 날짜명(하위 폴더) 안에 'export data' 라는

기본적인 파일명을 가진 파일을 찾는 것입니다. 따라서 위의 dir( ) 결과값 중 경로 데이터를 이용하여 개체명,

날짜명을 분할(parsing 이라고 합니다) 하여 정보를 취득하여야 합니다.

시작하기에 앞서 한가지 조건을 더 추가하였습니다.

열흘간의 실험에서 1주일 후 같은 행위를 반복하여 데이터를 기록할 수도 있어서 '1week-treated' 와 '1week-control'

이라는 하위 폴더를 추가하였습니다. 이것은 2주, 3주 혹은 그 이상의 데이터를 추가할 수도 있으며, 하이픈(' - ')을

통해 같은 날짜이여도 추가적인 행위를 뜻하는 폴더를 생성할 수도 있습니다.

조금 더 현실적이고 도전적인 프로그래밍 설계가 요구 될것이라고 생각하여 위의 추가 조건을 구상하였습니다.

1) 파일 경로 parsing - strsplit( )

본 포스트의 목적은 올바른 Excel 파일을 찾는것이므로 우선 상하위 폴더의 ' *.xlsx ' 확장자를 가진 모든 파일을

검색합니다. 앞서 본것과 같이 "list_folder" 는 struct 타입으로 name, folder 등의 field 를 포함하고 모든 검색된

파일의 갯수 만큼 데이터 길이를 갖습니다.

for-loop 를 이용하여 파일 경로의 정보를 갖는 "folder" 의 field의 데이터 하나씩 parsing을 합니다.

예를들어, 파일 경로가 path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx' 이고,

parsing 조건인 delimiter를 ' \ ' 로 지정한다면 총 6개의 parsing 결과를 얻습니다.
{'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'}

이러한 연산을 수행하는 명령어는 strsplit( ) 이고 다양한 delimiter 조건을 입력 파라메터로써 활용할 수 있습니다.

만일 parsing 과정없이 full path 에서 원하는 정보 - 'Animal'+숫자, 'Day'+숫자 등의 키워드 - 를 찾기는 어려울 수도

있기 때문에 parsing을 수행후 분리된 파일 경로를 하나씩 조건과 비교하며 원하는 정보를 찾을 수 있습니다.

list_folder = dir('**\*.xlsx');
num_folder = length(list_folder);

for folder_iter = 1:1:num_folder
    
    % regular expression does not recognize Animal 1 & Animal 10 well, 
    % to avoid this, split path into several pieces using 
    % multiple demiters - '-','_'. 
    % and then, compare animalList to each pieces.
    % strsplit() is required to this work
    file_split = strsplit(list_folder(folder_iter).name,{'-','_','.'},'CollapseDelimiters',true);
    path_split = strsplit(list_folder(folder_iter).folder,{'\','/','-','_'},'CollapseDelimiters',true);

    % ex ) path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx'
    %      split = 1×6 cell array
    %              {'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'}

2) 문자열 검색 조건 - regexpi( )

앞서 파일 경로의 parsing 예시로써 경로가 path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx' 인 경우,

결과는 {'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'} 총 6개로 나뉘어 집니다.

개체명을 갖는 폴더이름은 'Animal' + 숫자 조합을 갖습니다. 이것을 regular expression 의 pattern 으로 표현하자면

'Animal\d'. 즉, 'Animal' 이라는 글자에 숫자 표현을 붙인 조합의 string을 검색한다는 뜻입니다. 'Animal'의 대소문자

구분을 하지 않기 위해 여기서는 regexpi( ) 명령어를 사용하였지만 대소문자 구분이 필요하다면 regexp( ) 명령어를

사용하여야 합니다.

또한, 파일 경로중 오타에 의해 잘못된 표현 - 'Animl 1', 'animai 1' 등 - 이 parsing 되어 있다면 regular expression의

pattern은 결코 적합하지 않습니다. 따라서 여기서는 두가지 조건을 갖춘 하나의 regular expression을 pattern으로

지정하여 매칭하였습니다. 즉 'animal' 의 앞 세글자 'ani' 와 숫자 형식 ' \d* ' 의 조합 'ani\d*' 입니다.

{'Animal1'} 이 regexpi( )의 입력 파라메터이고, 위의 pattern 을 2개로 나누어 {' ani '} {' \d* '}를 각각 비교한다면,

pattern {' ani '} 의 경우 첫번째 글자부터 일치하므로 결과값은 1, pattern {' \d* '} 는 숫자 1이 7번째에 나오므로

결과값 7이 출력됩니다.

>> regexpi('Animal1','ani')
ans =
     1

>> regexpi('Animal1','\d*')
ans =
     7

regexpi( ) 명령어를 이용한 코드는 개체 이름 뿐만 아니라 날짜 이름, 파일 이름을 찾는데 사용되므로 별도의 함수로

작성하였습니다. 개체 이름과 날짜 이름 그리고 날짜중에 'week' 를 표시하는것 모두 '문자열+숫자' 또는 '숫자+문자열'

조합이므로 각 검색결과는 " found_info " 에 '문자열, 숫자' 순서로 저장하여 반환됩니다.

{'Animal1'}의 경우, animal_found_info = {'Animal', 1};

path_split = strsplit(list_folder(folder_iter).folder,{'\','/','-','_'},'CollapseDelimiters',true);
% ex ) path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx'
%      split = 1×6 cell array
%              {'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'}
%      found_info = {'Animal',1} or {'Day',1} or {'File','export data-trail1.xlsx'}
    
animal_pattern = {'ani\d*','\d'};           
animal_found_info = matching_func(path_split,animal_pattern,'Animal');
    
function [found_info] = matching_func(str_split,str_pattern,cate,varargin)
    found_info = {cate, NaN};                % initialization
    
    for splitIter = 1:1:length(str_split)
        if regexpi(str_split{splitIter},str_pattern{1})
           number = str2double(str_split{splitIter}(regexpi(str_split{splitIter},str_pattern{2})));
           if isa(number,'numeric')         % check whether number is numeric 
              found_info = {cate, number};                
           end          
        end
    end
end

하지만 날짜 폴더는 개체 폴더에 비해 생각할 경우의 수가 더 있습니다.

'Day1' 과 같이 '문자열+숫자' 조합이 아닌, '1week-treated' 와 '1week-control' 같이 '숫자+문자열+문자열' 의 조합이

존재하기 때문입니다. 그러므로 날짜 폴더를 매칭하는 경우 3단계로 나누어 매칭과정을 진행합니다.

path = 'C:\Documents\Animal1\1week-treated\export data-trial1.xlsx' 이고,

parsing 결과는 {'C:'} {'Documents'} {'Animal1'} {'1week-treated'} {'export data'} {'trial1.xlsx'} 입니다.

1) {' da\d* '} pattern으로써 날짜 폴더를 매칭합니다. 'day'의 마지막 글자 오타를 염두해 앞의 두글자 'da' 로 지정.

2) 만일 1)의 조건에 맞는 결과가 없다면, {' \dwee '} pattern으로 매칭합니다. 숫자와 'week' 의 앞 세글자를 결합한

regular expression를 pattern 으로써 지정하였습니다.

3) '1week' 를 찾았다면 다음은 '-treated' 와 '-control' 을 찾아야 합니다. 이를위해 부가적인 pattern인 {'cont '}

{'trea '} 을 지정 하였고, '1week' 를 찾은 위치부터 부가적인 pattern 들을 다시 검색합니다.

검색결과는 day_found_info = {'1week', 'Control'}; 또는 {'1week', 'Treated'}; 으로 반환됩니다.

일반적인 경우, 즉 'Day1'과 같이 {' da\d '} pattern으로 매칭된 결과는 day_found_info = {'Day', 1}; 이 반환됩니다.

day_pattern = {'da\d*','\d'};
week_pattern = {'\wwee\d*','\d'};
week_add_pattern = {'cont','trea'};  

day_found_info = matching_func(path_split,day_pattern,'Day');
% if 'Day' keyword was not found, try 'week' keyword
if isnan(day_found_info{2})     
   day_found_info = matching_func(path_split,week_pattern,'week',{'Control','Treated'},week_add_pattern);
end
    
function [found_info] = matching_func(str_split,str_pattern,cate,varargin)
    found_info = {cate, NaN};                % initialization
    
    for splitIter = 1:1:length(str_split)
        if regexpi(str_split{splitIter},str_pattern{1})
           number = str2double(str_split{splitIter}(regexpi(str_split{splitIter},str_pattern{2})));
           if isa(number,'numeric')         % check whether number is numeric 
              found_info = {cate, number};  
              
              % searching 'week' keyword needs an extra for-loop, because
              % of 'control' & 'treated' sub-keywords
              if strcmpi(cate,'week') & ~isnan(found_info{2})
                 week_cate = varargin{1};
                 week_str_pattern = varargin{2};
                  
                 for splitIter2 = splitIter:1:length(str_split)
                     if regexpi(str_split{splitIter2},week_str_pattern{1})  
                        found_info = {[num2str(number),cate], week_cate{1}};     % ex) {'1week','Control'}
                        break
                     elseif regexpi(str_split{splitIter2},week_str_pattern{2})
                        found_info = {[num2str(number),cate], week_cate{2}};     % ex) {'1week','Treated'}
                        break 
                     end
                 end
                 
              end
           end          
        end
    end
end

최종 검색대상인 Excel file의 파일명은 보통 'export data.xlsx' 이지만, 간혹 'export data-trial1.xlsx' 라는 이름을

갖기도 합니다. 여기서 중요한 pattern은 3가지인데 바로 {' exp '} {' xlsx '} {' tr\d '} 입니다.

파일명이 'export data.xlsx' 이라면 {' exp* '} {' xlsx '} 두가지 pattern으로 매칭이 가능하지만

'export data-trial1.xlsx' 의 경우 trial1 인지 trial2 인지 구분하기 위해 {' exp '} {' xlsx '} {' tr\d* '} 세개의 pattern이

필요합니다. 또 한가지 유의할 점은 parsing의 delimiter에 점 (' . ' )이 포함되어 있습니다.

우선 첫번째 pattern인 'exp'가 포함된 대상을 매칭 합니다. 그 다음 세번째 pattern인 'tr\d' 의 매칭을 시도하고,

마지막으로 'xlsx' pattern 으로 검색을 합니다.

파일명이 'export data.xlsx' 이라면 세번째 pattern은 매칭이 되지 않을것입니다. 따라서 파일명에 따라 세번째

pattern이 매칭이 될수도 있고 안될수도 있는것 입니다.

regular expression의 결과값을 생성하기 위해 "str" 이라는 임시변수가 매칭여부를 문자열로 생성하고 결과값을

담은 변수로 넘겨줍니다. 그리고 최종 결과값은 아래와 같습니다.

file_found_info = {'File', 'export data.xlsx'}; 또는

{'File', 'export data-trial1.xlsx'}; {'File', 'export data-trial2.xlsx'};

file_pattern = {'exp','xlsx','tr\d*'};        
file_add_pattern = {'trial'};        
file_split = strsplit(list_folder(folder_iter).name,{'-','_','.'},'CollapseDelimiters',true);
file_found_info = matching_func(file_split,file_pattern,'File');

function [found_info] = matching_func(str_split,str_pattern,cate,varargin)
    found_info = {cate, NaN};                % initialization
    
    for splitIter = 1:1:length(str_split)
        if regexpi(str_split{splitIter},str_pattern{1})
           str = [str_split{splitIter}];    % str = ['export data']
           for splitIter2 = splitIter:1:length(str_split)
               if regexpi(str_split{splitIter2},str_pattern{3})
                  number = str2double(str_split{splitIter2}(regexpi(str_split{splitIter2},'\d')));
                  str = [str,'-','trial',num2str(number)];  
                  found_info = {cate,str};
               end
                   
               if regexpi(str_split{splitIter2},str_pattern{2})
                  str = [str,'.xlsx'];  
                  found_info = {cate,str};  
               end
                   
           end
        end
    end
end

3) 결과 출력

상하위 폴더들을 검색하면서 찾은 Excel file 들의 파일명을 각 개체, 날짜별로 분류하여 출력하겠습니다.

5개의 모든 개체는 12개의 날짜별 Excel file을 가지고 있으므로 MATLAB의 cell 타입 변수를 5행 12열로 초기화

하겠습니다. 즉, 각 행은 날짜를 나타내고 각 열은 개체를 뜻합니다.

cell 타입 변수의 첫번째 열에는 Animal1의 Day1 부터 1week-treated 까지의 Excel 파일명을 담고,

첫번째 행에는 모든 개체의 Day1 Excel 파일명을 저장하는 식입니다.

만일 '-trial1.xlsx', '-trial2.xlsx' 같이 한날 두개의 Excel file이 존재하는 경우 두개의 파일명을 하나의 cell에 2x1 형태의

문자열로 저장합니다. "str"은 임시변수로써 새로운 문자열을 생성/조합하기 위해 쓰였습니다.

animal_range = 1:5;         % how many animals
day_range = 1:10;           % how many days(incl. day1-10, 1week-control, 1week-treated)
dtCell_allData = cell(length(day_range)+2,length(animal_range));

str = [animal_found_info{1},num2str(animal_found_info{2}),'-'];
if isa(day_found_info{2},'numeric')
   row = day_found_info{2};
   str = [str,day_found_info{1},num2str(day_found_info{2})];
elseif isa(day_found_info{2},'char')
   if strcmpi(day_found_info{2},'control')
      row = day_range(end)+1;
   elseif strcmpi(day_found_info{2},'treated')
      row = day_range(end)+2;
   end
   str = [str,day_found_info{1},'-',day_found_info{2}];
end
col = animal_found_info{2};
str = [str,'-',file_found_info{2}];

% integrate result of navigation to folder information
if isempty(dtCell_allData{row,col})
   dtCell_allData{row,col} = str;
else                                % case of '-trial1.xlsx' & '-trial2.xlsx'   
   dtCell_allData{row,col} = {dtCell_allData{row,col}; str};
end

마지막으로 본 글에 쓰인 모든 MATLAB 코드는 밑에 표시해 두었습니다.

% parameters initialization
animal_range = 1:5;         % how many animals
day_range = 1:10;           % how many days(incl. day1-10, 1week-control, 1week-treated)
dtCell_allData = cell(length(day_range)+2,length(animal_range));

% Get folder contents in all subfolders

% list_folder: struct type data, field names: 'name','folder', so on.
list_folder = dir('**\*.xlsx');         
num_folder = length(list_folder);

animal_pattern = {'ani\d*','\d'};           
day_pattern = {'da\d*','\d'};
week_pattern = {'\wwee\d*','\d'};

week_add_pattern = {'cont','trea'};  
file_pattern = {'exp','xlsx','tr\d*'};        
file_add_pattern = {'trial'};

% parsing path info. and navigate to folder
for folder_iter = 1:1:num_folder
    
    % regular expression does not recognize Animal 1 & Animal 10 well, 
    % to avoid this, split path into several pieces using 
    % multiple demiters - '-','_'. 
    % and then, compare animalList to each pieces.
    % strsplit() is required to this work
    file_split = strsplit(list_folder(folder_iter).name,{'-','_','.'},'CollapseDelimiters',true);
    path_split = strsplit(list_folder(folder_iter).folder,{'\','/','-','_'},'CollapseDelimiters',true);

    % ex ) path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx'
    %      split = 1×6 cell array
    %              {'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'}
    %      found_info = {'Animal',1} or {'Day',1} or {'File','export data-trail1.xlsx'}
    
    file_found_info = matching_func(file_split,file_pattern,'File');
    animal_found_info = matching_func(path_split,animal_pattern,'Animal');
    day_found_info = matching_func(path_split,day_pattern,'Day');
    % if 'Day' keyword was not found, try 'week' keyword
    if isnan(day_found_info{2})     
       day_found_info = matching_func(path_split,week_pattern,'week',{'Control','Treated'},week_add_pattern);
    end
    
    str = [animal_found_info{1},num2str(animal_found_info{2}),'-'];
    if isa(day_found_info{2},'numeric')
       row = day_found_info{2};
       str = [str,day_found_info{1},num2str(day_found_info{2})];
    elseif isa(day_found_info{2},'char')
       if strcmpi(day_found_info{2},'control')
          row = day_range(end)+1;
       elseif strcmpi(day_found_info{2},'treated')
          row = day_range(end)+2;
       end
       str = [str,day_found_info{1},'-',day_found_info{2}];
    end
    col = animal_found_info{2};
    str = [str,'-',file_found_info{2}];

    % integrate result of navigation to folder information
    if isempty(dtCell_allData{row,col})
       dtCell_allData{row,col} = str;
    else                                % case of '-trial1.xlsx' & '-trial2.xlsx'   
        dtCell_allData{row,col} = {dtCell_allData{row,col}; str};
    end
    
end

function [found_info] = matching_func(str_split,str_pattern,cate,varargin)
    % ex)  path = 'C:\Documents\Animal1\Day1\export data-trial1.xlsx'
    %      split = 1×6 cell array
    %              {'C:'} {'Documents'} {'Animal1'} {'Day1'} {'export data'} {'trial1.xlsx'}
    %      found_info = {'Animal',1} or {'Day',1} or {'File','export data-trail1.xlsx'}
    
    % if any numeric data in the if-condition not found below, found_info returns NaN
    found_info = {cate, NaN};                % initialization
    
    for splitIter = 1:1:length(str_split)
        % category: 'File'
        if strcmpi(cate,'file')             
            if regexpi(str_split{splitIter},str_pattern{1})
               str = [str_split{splitIter}];    % str = ['export data']
               for splitIter2 = splitIter:1:length(str_split)
                   if regexpi(str_split{splitIter2},str_pattern{3})
                      number = str2double(str_split{splitIter2}(regexpi(str_split{splitIter2},'\d')));
                      str = [str,'-','trial',num2str(number)];  
                      found_info = {cate,str};
                   end
                   
                   if regexpi(str_split{splitIter2},str_pattern{2})
                      str = [str,'.xlsx'];  
                      found_info = {cate,str};  
                   end
                   
               end
            end
        
        % category: 'Animal', 'Day'
        elseif regexpi(str_split{splitIter},str_pattern{1})
           number = str2double(str_split{splitIter}(regexpi(str_split{splitIter},str_pattern{2})));
           
           if isa(number,'numeric')         % check whether number is numeric 
              found_info = {cate, number};     
              
              % searching 'week' keyword needs an extra for-loop, because
              % of 'control' & 'treated' sub-keywords
              if strcmpi(cate,'week') & ~isnan(found_info{2})
                 week_cate = varargin{1};
                 week_str_pattern = varargin{2};
                  
                 for splitIter2 = splitIter:1:length(str_split)
                     if regexpi(str_split{splitIter2},week_str_pattern{1})  
                        found_info = {[num2str(number),cate], week_cate{1}};     % ex) {'1week','Control'}
                        break
                     elseif regexpi(str_split{splitIter2},week_str_pattern{2})
                        found_info = {[num2str(number),cate], week_cate{2}};     % ex) {'1week','Treated'}
                        break 
                     end
                 end
                 
              end
              break
              
           end
           
        end
    end
end

'MATLAB - Data Processing 기초' 카테고리의 다른 글

[MATLAB Data Processing 기초 2] Folder 만들기 및 Navigate to folder - 1 ( cd(), mkdir() ) (0)	2023.08.15
[MATLAB Data Processing 기초 1] Excel file - 불러오기, 저장하기( readtable(), writetable() ) (1)	2023.08.09

현재글[MATLAB Data Processing 기초 2] Folder 만들기 및 Navigate to folder - 2 ( dir(), regexp(), strsplit() )

Toy Projects